Named Entity Recognition in Healthcare Domain

Esin Ildiz

Written by

Esin Ildiz

Named Entity Recognition (NER) allows you to detect the key elements or words that you are interested in in the texts, documents, slides and even in videos. It can be basically seen as tagging keywords with desired labels in order to extract meaningful information from a large amount of unstructured data. In this article, we will focus on a NER tool that we developed as Hyntelo R&D team for the healthcare domain data. 

Due to the rapidly growing amount of unstructured data in the healthcare domain, it is pretty important to be able to extract and govern medical data. With our tagging tool, we are able to detect anatomical parts, diseases, tests, treatments, drugs and medication names inside different types of documents and then categorize them.

If you wonder how you could use these tagged documents, here are some examples:

  1. If we consider Electronic Health Records, adverse drug reactions and their co-occurrence in patients can be analyzed. Pharmaceutical companies are interested in understanding which drug is causing which adverse reaction to prevent these occurrences.
  2. Promotional materials can be tagged to help REPs to find the most relevant information to present to HCPs.
  3. Even videos can be tagged using AI-based algorithms that extract text through speech transcription so that diseases, drugs, tests & treatments, anatomical structures, etc. can be detected.

What makes NER special?

Imagine that you have a text file that contains anatomical structure names. Thanks to this text file, you could just tag anatomical structures in a given document, basically by making a simple match with the data that you have. So why do we need Deep Learning algorithms to do that? Well, there are many reasons. Let’s say that in the text document you want to tag, there is a phrase which is “on the other hand”. Without a deep learning algorithm, you will tag your document as if it is talking about an anatomical structure, a “hand”.  NER instead will consider the contextual meaning of the text.

How did we develop our NER tool: Lyriko Autotagging?

In order to have a robust NER model, you need to train your model with as much data as possible so that the model will learn and generalize the entities well enough. So it basically predicts the label of the entities by leveraging the surrounding words of the entities. Moreover, it is almost impossible that you will be able to add all the existing anatomical structures, diseases, etc. to your text file. So, you will miss them. However, thanks to our deep learning model, even if the model hasn’t seen a specific disease name in the trained data, it will still be capable of predicting it as a disease from the contextual meaning. Almost like a human!

In order to train your model, a huge amount of labelled data is needed. At Hyntelo we collected a set of so-called “corpora” specialized for the healthcare language domain. Each corpus is a set of thousands of annotated documents, where each occurrence of a named entity is highlighted and annotated with metadata (e.g. with the category). This set of corpora can be further expanded to customize and fine-tune our model based on customer requirements.

Our model surpasses the accuracy of other publicly available or commercial NER packages specialized in medical content (e.g. AWS Comprehend Medical).

Our auto-tagging solution comprises also AI models and algorithms to preprocess or post-process NER results, including fixing the typos that are caused by OCRed documents mostly.

What is Entity Linking (EL)? 

Entity Linking is an activity during which a named entity found with a NER model is linked to a knowledge base item in the related domain. It helps to find the unique identifier of the ambiguous textual mentions by looking at the context of the mention using the concepts in the knowledge base. As in the case of NER, also in this case at Hyntelo, we collected different knowledge bases, which are developed by experts in the medical domain. They contain more than 4 million names with 1 million concepts and 12 million relations between these concepts. These concepts are linked to each other in a hierarchically way and they are periodically updated.

Thus, using our  EL model, we are able to further enrich document metadata with more structured information. As an example, consider the figure below, where migraine and headache NER entities are both standardized under the headache-linked entity, which in turn is under the sign&symptom type. By taking advantage of this information we are also able to verify our NER labels and increase the overall confidence score.

Last but not least, we have full control of our auto-tagging solution. We can expand the labels, add custom entities or add certain entities to a blacklist. We can easily integrate it into any infrastructure, providing full customer support and customization.

Share on