Anomaly detection based on frequencies of structured features is widely studied for cybersecurity, fraud detection, event detection, etc. In contrast, anomaly detection from rich-text records have not been sufficiently investigated in recent years, although it has valuable applications in some problems. The anomalies from such rich-text datasets are more meaningful for decision-making because the algorithm can provide what word occurrences in a record deviate from the normal pattern. This is particularly useful for rich-text healthcare records: in addition to insurance frauds, it can also help identify potential unmatched diagnosis, record errors, unknown drug effects, extreme medical cases, medical resource abuses, etc.
The analysis is challenging due to it is sparse and high-dimensional, and we adopt probabilistic methods to take on the challenges. Our approach to the problem is probabilistic generative models, which have been proven useful for other applications. There has been an initial research that adopts Latent Dirichlet Allocation (LDA) to establish background distribution for rich-text records in order to detect outliers that deviate from it, and it shows promising effect on medical records; however, there are still many limitations. For example, current methods need to specify the number of potential categories, and the number of sub-categories for each category, and such manual specification can be very difficult for big data. Meanwhile, we propose to consider other generative models, we can go beyond the bag-of-words assumptions of existing models, and introduce word network to include word dependencies in the analysis.
The other benefit of probabilistic approach, rather than some optimization approach, is its better support and easier adaptation for a real-time online analysis. The backgrounds need not be re-computed when new data arrives, rather, they are updated with much less computational cost. Moreover, it helps capture the potential unusual in a timely manner, and ask the doctors, drug laboratories or insurance companies to pay prompt attention; this is particular important as time is a crucial factor in the quality of healthcare.
In this project, we aim to 1) design models for detection of anomalies from rich text, and develop an algorithm that is capable of digesting hundreds of millions of records, and conduct comprehensive experiments to prove its effectiveness; 2) convert the algorithm to an online version that dynamically updates the background distribution based on new data, and experiment on its capability of real-time anomaly detection.