Medical Concept Embeddings via Labeled Background Corpora

Resources of the publication: embeddings, software and external sources

This page contains the resources used in and resulting from

Eneldo Loza Mencía, Gerard de Melo and Jinseok Nam, Medical Concept Embeddings via Labeled Background Corpora, in: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), 2016 [bibtex]


The vector representations

Example Code

You will find a full example script for evaluating the similarity between pairs of medical concepts in, together with the necessary text files. Just copy the embeddings vector file at the right position (data/medical_aitext). The main steps you have to perform are the following.

Load the embeddings:

description_model = load_model('BioASQ_train_full_no_desc.vectors')

For looking up an embedding vector, find out the label index (see seen_label_vocabulary.txt) and then just make a look-up. Getting the word embeddings works very similarly.


When you have embeddings for two concepts, you can obtain the similarity by just computing the inner product. Do not mix up label and word embeddings.

sim_desc=((1 - spatial.distance.cosine(left_emb,right_emb))+1)/2 #maps to [0;1]

You can compute the Spearman correlation by using the built-in functions from scipy:

sim_rho, _ = spearmanr(target_sim_scores, label_emb_sim_scores)


The embeddings were learned with the software AiTextML written by Jinseok Nam, see also the corresponding publication. The source code and installation instructions are available at the project site at GitHub.

Other Resources

Assessed Pairs of Medical Concepts

Around 500-600 pairs of medical concepts were assessed by human experts regarding their similarity (UMNSRS_similarity.csv) and relatedness (UMNSRS_relatedness.csv) and made available through Medical Residents Similarity and Relatedness Set datasets. In addition, the Medical Coders Set (MayoSRS.terms) provides 101 pairs. All dataset were made available by the University of Minnesota.

Embeddings trained from PubMed

Pretrained word embeddings trained on abstracts and full documents from PubMed  and the Wikipedia were used from the Natural Language Processing Laboratory.

Web-Interface for computing path-based proximity

A web-interface to the UMLS::Similarity software package for obtaining similarity and relatedness measures between biomedical terms is available.

UMLS Ontology

The Unified Medical Language System ontology is available through a web interface or you can download it from the web site. However, you will need a (usually free) account. The UMLS ontology also includes a mapping to the MeSH ontology.

MeSH 2015 ontology

The used concepts are from the Medical Subjects Headings ontology. You can download the descriptors from Please note that we used 2015 MeSH in our experiments.

BioASQ background corpus

The BioASQ dataset is a subset from the PubMed database for biomedical publications and can be downloaded by the competition site (Task 3a) after registration.

Terms of Use

The data provided by the authors on this site is freely available. For external software (including AiTextML) or data that may be included in the distributables like libraries or datasets, please contact the original authors for their terms of use. Nevertheless, we would be glad if you would cite this site or our paper if you use the provided software or data.



small ke-icon

Knowledge Engineering Group

Fachbereich Informatik
TU Darmstadt

S2|02 D203
Hochschulstrasse 10

D-64289 Darmstadt

Telefon-Symbol+49 6151 16-21811
Fax-Symbol +49 6151 16-21812

A A A | Drucken | Impressum | Sitemap | Suche | Mobile Version
zum Seitenanfangzum Seitenanfang