From: Precision information extraction for rare disease epidemiology at scale
Labels | Counts (% of Labels on Tokens) | ||
---|---|---|---|
Train set | Validation set | Test set | |
DIS | 5051 (48.96%) | 1019 (42.46%) | 432 (33.44%) |
ABRV | 1808 (17.53%) | 421(17.54%) | 272 (21.05%) |
DATE | 660 (6.40%) | 175 (7.29%) | 96 (7.43%) |
LOC | 764 (7.41%) | 262 (10.92%) | 118 (9.13%) |
EPI | 747 (7.24%) | 230 (9.58%) | 116 (8.98%) |
ETHN | 192 (1.86%) | 33 (1.38%) | 16 (1.24%) |
SEX | 282 (2.73%) | 77 (3.21%) | 36 (2.79%) |
STAT | 812 (7.87%) | 183 (7.63%) | 206 (15.94%) |
Sum of all labels (% of Total Labels) | 10,316 (9.02% = 10,316/114,425) | 2,400 (7.79%) | 1,292 (9.29%) |
Total (including O tag) | 114,425 | 30,807 | 13,909 |