Skip to main content

Table 2 Number of labels in the rare disease epidemiology dataset for NER

From: Precision information extraction for rare disease epidemiology at scale

Labels

Counts (% of Labels on Tokens)

Train set

Validation set

Test set

DIS

5051 (48.96%)

1019 (42.46%)

432 (33.44%)

ABRV

1808 (17.53%)

421(17.54%)

272 (21.05%)

DATE

660 (6.40%)

175 (7.29%)

96 (7.43%)

LOC

764 (7.41%)

262 (10.92%)

118 (9.13%)

EPI

747 (7.24%)

230 (9.58%)

116 (8.98%)

ETHN

192 (1.86%)

33 (1.38%)

16 (1.24%)

SEX

282 (2.73%)

77 (3.21%)

36 (2.79%)

STAT

812 (7.87%)

183 (7.63%)

206 (15.94%)

Sum of all labels (% of Total Labels)

10,316 (9.02% = 10,316/114,425)

2,400 (7.79%)

1,292 (9.29%)

Total (including O tag)

114,425

30,807

13,909