Medical Concept Embeddings via Labeled Background Corpora
This page contains the resources used in and resulting from
Medical Concept Embeddings via Labeled Background Corpora, in: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), 2016 [bibtex], and ,
The vector representations
- BioASQ_train_full_no_desc.vectors: The label, word, and document embeddings, as python objects.
- MeSH_name_id_mapping_2015.txt: Mapping between MeSH concept name and MeSH-ID
- seen_label_vocabulary.txt: list of medical concepts (labels) for which embeddings exist, ordered according to number of occurrences
- word_vocabulary.txt: list of words for which embeddings exist, ordered according to number of occurrences
Will come soon!
The embeddings were learned with the software AiTextML written by Jinseok Nam, see also the corresponding publication. The source code and installation instructions are available at the project site at GitHub.
Assessed Pairs of Medical Concepts
Around 500-600 pairs of medical concepts were assessed by human experts regarding their similarity (UMNSRS_similarity.csv) and relatedness (UMNSRS_relatedness.csv) and made available through Medical Residents Similarity and Relatedness Set datasets. In addition, the Medical Coders Set (MayoSRS.terms) provides 101 pairs. All dataset were made available by the University of Minnesota.
Embeddings trained from PubMed
Web-Interface for computing path-based proximity
A web-interface to the UMLS::Similarity software package for obtaining similarity and relatedness measures between biomedical terms is available.
The Unified Medical Language System ontology is available through a web interface or you can download it from the web site. However, you will need a (usually free) account. The UMLS ontology also includes a mapping to the MeSH ontology.
MeSH 2015 ontology
The used concepts are from the Medical Subjects Headings ontology. You can download the descriptors from http://www.nlm.nih.gov/mesh/filelist.html. Please note that we used 2015 MeSH in our experiments.
BioASQ background corpus