Ontology-based Fast Semantic Indexing for Structured and Unstructured Data in Health Care 7a.003.UL

Project Start Date: Aug 1, 2018
Research Areas: Data Management, Data Management - Ontologies
Funding: Member Funded
Project Tags: ,

Project Summary

In the current big data environment, most of the data is gathered from multiple sources. Entity resolution or duplication of data is a major problem in this scenario. This duplicate data is more pronounced in-patient data from health care. Recent studies indicate that about 15% of the Master Patient Index of major hospitals are duplicate entries.  Issues like heterogeneous data, incomplete information, constantly changing properties associated with entities, and temporal information pose major challenges to identifying duplicate entities in the data. To solve this problem, we propose a indexing technique that identifies duplicate information from databases using ontology based semantic measures. The proposed approach generates a global identifier for each entity based on the distances of the properties associated with the entity to core nodes within the semantic graph extracted from the ontology. Partial and complete match algorithms will be applied on the global identifier to identify duplicate records. The identifier can be updated based on changes to the properties associated with the entity.  Our project proposes a proof of concept to identify duplicate records in a Master Patient Index that indexes the data using a global patient identifier that is based on the demographic and clinical profile of the patient. We aim to significantly improve the performance of the deduplication algorithm over the traditional baseline algorithms.