Nature/Science 论文阅读笔记

Unsupervised word embeddings capture latent knowledge from materials science literature

The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods.

By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases, which encompass only a small fraction of the knowledge present in the research literature.

Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors.

To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing, which requires large hand-labelled datasets for training.

Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings (vector representations of words) without human labelling or supervision.

Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials.

Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery.

This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications.

Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.

《无监督词嵌入从材料科学文献中获取潜在知识》

绝大多数的科学知识都是以文本形式发表的,无论是传统的统计分析还是现代的机器学习方法都很难对其进行分析。

相比之下,材料研究界机器可解释数据的主要来源是结构化属性数据库,其中仅包含研究文献中的一小部分知识。

除了属性值之外,出版物还包含关于作者解释的数据项之间的连接和关系的有价值的知识。

为了更好地识别和利用这些知识,一些研究集中在利用有监督的自然语言处理从科学文献中检索信息,这需要大量的手工标注数据集进行训练。

在这里,我们表明,在没有人类标记或监督的情况下,已发表文献中的材料科学知识可以有效地编码为信息密集的单词嵌入(单词的向量表示)。

没有任何化学知识的明确插入,这些嵌入捕捉复杂的材料科学概念,如周期表的底层结构和材料中的结构-性质关系。

此外,我们还证明了无监督方法可以在材料发现前几年为功能应用推荐材料。

这表明,有关未来发现的潜在知识在很大程度上嵌入了过去的出版物中。

我们的发现强调了以集体的方式从大量的科学文献中提取知识和关系的可能性,并指出了挖掘科学文献的普遍方法。

最新文章

  1. vim的一些常用命令(一)
  2. HDU 4946 Area of Mushroom(2014 Multi-University Training Contest 8)
  3. 安装courier-authlib找不到mysqlclient.so文件
  4. Leetcode Binary Tree Inorder Traversal
  5. TestPointer
  6. EF 增删改查
  7. Windows下JNI执行步骤
  8. LeetCode题解——Two Sum
  9. c++异常机制实现原理
  10. 双向链表JAVA代码
  11. C# 关闭子线程的方法
  12. HDU 1728 逃离迷宫(DFS||BFS)
  13. gis电子地图开发公司面临的挑战和机遇
  14. (NO.00001)iOS游戏SpeedBoy Lite成形记(二十二)
  15. 前端js面向对象编程以及封装组件的思想
  16. SQLServer之删除存储过程
  17. 启动欢迎页面时,Android Studio设置全屏Activity
  18. java中int和String之间的转换
  19. 关于HttpClient,HttpURLConnection,OkHttp的用法
  20. [No000010E]Git7/9-标签管理

热门文章

  1. docker下进去mysql 编写语句
  2. Java构造二叉树、树形结构先序遍历、中序遍历、后序遍历
  3. Linux下创建仓库的软件包createrepo
  4. 火狐插件simple timer 定时打开指定网页
  5. 关于Mongodb的其他知识
  6. 如何卸载Win10 RS3上预装的office2016
  7. Linux排查问题工具汇总
  8. Zookeeper服务注册与发现原理浅析
  9. vue2.0 移动端,下拉刷新,上拉加载更多 封装组件
  10. php的异步非阻塞swoole模块使用(一)实现简易tcp服务器--客户端