Objectives: Free text documents such as scientific literature contain abundant knowledge about relationships among concepts or entities. Unfortunately, this type of knowledge is expressed in natural language, where different types of relationships are not explicitly categorized. In this project, we developed techniques for extracting structured knowledge from unstructured data through weak supervision over existing sources of knowledge.
Methods: We used the Elsevier document corpus containing about 1 million articles in the neuroscience domain as our testbed for the project. We developed a novel relation extraction approach that integrates distant learning with open information extraction techniques. Unlike state-of-the-art models of relation extraction from text files, which are based on supervised learning, our approach does not need manually-labeled examples of relations. In addition, our model incorporates a grouping strategy to take into consideration the interdependency among entities occurring in one sentence, which has been largely ignored in previous studies.
Results: We developed and implemented a distance supervision method for extracting gene expression relationships between genes and brain regions. We conducted experiments against manually annotated “gold standards.” Our experimental results show that our methods can achieve better performance than baselines.
Conclusions: It is possible to develop general distant supervision approaches for relation extraction from free text. Such an approach would significantly reduce the effort on manually labeling training examples. With the success in the biomedical domain, we look forward to expanding our techniques to a broader range of applications.