Skip to main content

Table 1 Description of eight entity classes in the manually validated dataset

From: Precision information extraction for rare disease epidemiology at scale

Entity Class

Label

Definition

Example

Disease terms

DIS

Rare and non-rare disease names and synonyms including those which have a unique ID or code (ICD, GARD, UMLS). Includes pathogenic diseases, but not pathogens. Does not include symptoms, features of diseases, phenotypes, nor abbreviations of disease names

“Wegener's granulomatosis”,

“Metachromatic leukodystrophy”,

“Krabbe disease”

Disease abbreviations

ABRV

Abbreviations of the disease names or synonyms described above

“MPS” (Mucopolysaccharidoses),

“FSHD” (Facioscapulohumeral muscular dystrophy)

Epidemiology Type

EPI

The epidemiologic metric being reported

“Annualized incidence”, “point prevalence”, “estimated occurrence rate”

Epidemiology Rate

STAT

The number of people afflicted. Usually expressed as a fraction (rate), a percentage of the (sub)population, or an integer estimation/count of persons with the disease

“Approximately 1 in 40,000 live births”,

“50,000 people affected”

Location

LOC

Locations, including geopolitical entities, which indicate where the study took place

“North-Central Africa”,

“Salla region of northern Finland”,

“the United States”

Dates

DATE

When the study took place or when data was gathered

“Between 1985 and 2006”,

“January 21, 1999”

Biological Sex

SEX

Terms that were likely to indicate the biological sex of the persons mentioned in the study

“Men”,

“women”,

“intersex”

Ethnicity/Nationality/Race

ETHN

Terms that are likely to indicate nationality, race, or ethnicity of the persons afflicted by the disease

“Italian”, “Ashkenazi Jew”, “Marshallese”

  1. Detailed descriptions are listed in Additional file 3