Named Entity Recognition (NER) allows you to easily detect the key elements or words that you are interested in in the texts, documents, slides and even in videos. It can be basically seen as tagging keywords with desired labels in order to extract meaningful information from a large amount of unstructured data. In this article, we will focus on a NER tool that we developed as Hyntelo R&D team for the healthcare domain data.
Due to the rapidly growing huge amount of unstructured data in the healthcare domain, it is pretty important to be able to extract and govern medical data. With our tagging tool, we are able to detect anatomical parts, diseases, tests, treatments, medication names in the documents and then categorize them.
If you wonder how we would use these tagged documents, here are some examples:
- If we analyze Electronic Health Records, we can study adverse drug reactions and see their co-occurrence in patients. Pharmaceutical companies are interested in understanding which drug is causing which adverse reaction to prevent these occurrences.
- We can tag each document to help REPs to find the most relevant information to present HCPs easily.
- We can tag videos so that we can detect the diseases, drugs, tests & treatments, anatomical structures.. etc. that are mentioned in them.
What makes NER special?
Imagine that you have a text file that contains anatomical structure names. Thanks to this text file, you could just tag anatomical structures in a given document, basically by making a simple match with the data that you have. So why do we need deep learning algorithms to do that? Well, there are many reasons. Let’s say that in the text document you want to tag, there is a phrase which is “on the other hand”. Without a deep learning algorithm, you will tag your document as if it is talking about an anatomical structure, a “hand”. NER instead will consider the contextual meaning of the text.
In order to have a robust NER model, you need to train your model with as much data as possible so that the model will learn and generalize the entities well enough. So it basically predicts the label of the entities by leveraging the surrounding words of the entities. Moreover, it is almost impossible that you will be able to add all the existing anatomical structures, diseases, etc. to your text file. So, you will miss them. However, thanks to the deep learning model, even if the model hasn’t seen a specific disease name in the trained data, it will still be capable of predicting it as a disease from the contextual meaning. Almost like a human!
How did we develop our NER tool: Lyriko Autotagging?
In order to train your model, you need a huge amount of labelled data. There are not too many options for annotated data when the domain is restricted to healthcare. Either you will create your own data by using an annotation tool which probably won’t be enough to generate sufficient amounts of data for training or will take advantage of the already annotated publicly available corpus. We have used the following corpus for this project, however, we can easily expand the corpus based on any requirements from our customers:
- AnatEM: This corpus contains 1212 documents (approx. 250,000 words) which are manually annotated to identify over 13,000 mentions of anatomical entities.
- BC5CDR: It consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions.
- BioNLP13CG: A corpus that contains cancer genetics.
- NCBI Disease: The NCBI fully annotated disease corpus contains 793 PubMed abstracts, 6892 disease mentions and 790 unique disease concepts.
- i2b2-2010: Data includes de-identified patient information from the clinical health records with the information on tests and treatments.
And secondly, we took the advantage of the already trained Biomedical & Clinical NER Models In Stanza library developed by Stanford. This library augments the aforementioned corpus with pre-trained character-level language models for improved accuracy. The language models are pre-trained on the publicly available PubMed abstracts and on the clinical notes from the publicly available MIMIC-III database.
It surpasses the accuracy of other available medical NER packages which are Spacy and Biobert.
As a third step, we bootstrapped our results with lookup tables for medications by doing a simple matching using a Rule-based matching pipeline from Spacy. We also created some custom rules to detect specific entities that are following the same patterns.
However, the output of this NER was not as clean as we expected. So we go through a bunch of post processes, including fixing the typos that are caused by OCRed documents mostly. After all the cleaning we have made, we wanted to improve our accuracy and we decided to benefit from Name Entity Linking.
What is Named Entity Linking (NEL)?
Named Entity Linking is basically linking the entity found with NER to a knowledge base in the related domain. It helps to find the unique identifier of the ambiguous textual mentions by looking at the context of the mention using the concepts in the knowledge base. We have used rxnorm, umls, hpo knowledgebases, which are developed by experts in the medical domain. They contain more than 4 million names with 1 million concepts and 12 million relations between these concepts. These concepts are linked to each other hierarchically and are updated periodically.
Using NEL, we were also able to detect belonging groups of the entities and subgroups (types) and normalized entities (Linked Entity) so that we can create more structured data with the tags. As you can see from the figure below, migraine and headache entities are standardized under the headache-linked entity. And their type is under the sign&symptom. By taking advantage of types we were actually able to verify our NER labels and increase the confidence score.
Last but not least, we have full control and flexibility in the algorithm. We can expand the labels, add custom entities or add certain entities to the blacklist. We can easily integrate it into any infrastructure, providing full customer support and customizability.