Named Entity Recognition#

This page examines the performance of competing models for Danish named entity recognition over multiple datasets. Performance is not limited to accuracy, but also includes domain generalization, biases and robustness. This page is also a notebook, which can be opened and run to replicate the results.

State-of-the-Art comparison#

To our knowledge there exists three datasets for Danish named entity recognition;

  1. DaNE [Hvingelby et al., 2020], which uses the simple annotation scheme of CoNLL 2003 [Sang and De Meulder, 2003] with the entities; person, location, organization, and miscellaneus.

  2. DANSK, which uses the extensive annotation scheme similar to that of OntoNotes 5.0 [Weischedel et al., 2013] including more that 16 entity types.

  3. and DAN+ [Plank et al., 2021], which also uses the annotation scheme of CoNLL 2003, but allows for nested entities for instance Aarhus Universitet, where Aarhus is a location and Aarhus Universitet is an organization.

In this comparison we will be examing performance on DaNE and DANSK, but as no known models have been trained on Danish nested entities, we will not be comparing performance on DAN+.

Measuring Performance

Typically when measuring performance on these benchmark it is normal to feed the model the gold standard tokens. While this allows for easier comparisons of modules and architectures, it inflates the performance metrics. Further, it does not proberly reflect what you are really interested in: the performance you can expect when you apply the model to data of a similar type. Therefore we estimate the model is given no prior knowledge of the data, and only the raw text is fed to the model. Thus the performance metrics might be slightly different compared to e.g. DaNLP.

DaNE: Simple Named Entity Recognition#

As already stated DaNE uses an extraction from the CoNLL 2003 dataset, which is as follows [Hvingelby et al., 2020]:

Entity

Description

LOC

includes locations like cities, roads and mountains, as well as both public and commercial places like specific buildings or meeting points, but also abstract places.

PERSON

consists of names of people, fictional characters, and animals. The names includes aliases.

ORG

can be summarized as all sorts of organizations and collections of people, ranging from companies, brands, political movements, governmental bodies and clubs.

MISC

is a broad category of e.g. events, languages, titles and religions, but this tag also includes words derived from one of the four tags as well as words for which one part is from one of the three other tags.

Here is an example from the dataset:

To kendte russiske MISC historikere Andronik Mirganjan PERSON og Igor Klamkin PERSON tror ikke, at Rusland LOC kan udvikles uden en "jernnæve".

The table below shows the performance of Danish language processing pipelines scored on the DaNE test set. The best scores in each category are highlighted with bold and the second best is underlined.

F1 score with 95% confidence interval calculated using bootstrapping with 500 samples.
F1
Models Average Location Person Organization Misc.
da_dacy_large_trf-0.2.0 85.4 (81.2, 88.9) 89.5 (84.0, 94.7) 92.6 (89.0, 95.4) 79.0 (72.5, 84.6) 79.0 (70.8, 86.0)
da_dacy_medium_trf-0.2.0 84.9 (81.0, 88.5) 86.8 (81.2, 92.3) 92.7 (89.2, 95.6) 78.7 (71.8, 85.0) 78.7 (70.6, 86.1)
da_dacy_small_trf-0.2.0 82.7 (79.3, 85.9) 84.2 (78.3, 89.8) 92.2 (88.5, 95.1) 75.9 (69.3, 81.7) 75.7 (68.8, 81.8)
saattrupdan/nbailab-base-ner-scandi 86.3 (82.4, 89.7) 88.6 (83.0, 93.3) 95.1 (92.4, 97.8) 80.3 (73.6, 85.8) 78.6 (69.4, 86.0)
alexandrainst/da-ner-base 70.7 (66.2, 75.2) 84.8 (77.8, 91.0) 90.3 (86.3, 93.9) 64.7 (57.0, 71.3)
da_core_news_trf-3.5.0 79.0 (75.1, 82.3) 82.1 (75.5, 88.5) 91.6 (88.2, 94.5) 68.0 (61.0, 75.2) 69.0 (61.1, 77.3)
da_core_news_lg-3.5.0 74.6 (70.8, 78.1) 81.6 (75.3, 88.2) 85.5 (81.1, 89.9) 62.7 (54.8, 70.3) 64.4 (55.9, 72.8)
da_core_news_md-3.5.0 71.2 (66.9, 75.2) 76.8 (69.9, 83.6) 82.6 (77.8, 87.0) 58.2 (49.6, 66.7) 61.8 (52.6, 70.6)
da_core_news_sm-3.5.0 64.4 (59.7, 68.5) 61.6 (52.2, 69.9) 80.1 (74.9, 85.1) 49.0 (39.0, 57.5) 58.4 (49.8, 67.1)
openai/gpt-3.5-turbo (02/05/23) 57.5 (52.3, 62.2) 50.7 (41.9, 59.2) 81.9 (76.8, 86.5) 55.7 (47.1, 63.7)
openai/gpt-4 (02/05/23) 70.1 (66.0, 74.3) 78.9 (71.5, 85.7) 85.3 (80.4, 89.5) 72.0 (65.4, 78.5)

It is worth mentioning that while the da_dacy_large_trf-0.2.0 and saattrupdan/nbailab-base-ner-scandi performs similarly they have their independent strength and weaknesses. The large DaCy model is a multi-task model performing named-entity recognition as only one of its many tasks and thus if you wish to use one of those we would recommend that model. On the other hand the nbailab-base-ner-scandi is trained on multiple Scandinavian languages and thus might be ideal if your dataset might contain these languages as well. saattrupdan/nbailab-base-ner-scandi is available in DaCy using nlp.add_pipe("dacy/ner").

You are missing a model

These tables are continually updated and thus we try to limit the number of models to only the most relevant Danish models. Therefore models like Polyglot with strict requirements and consistently worse performance are excluded. If you want to see a specific model, please open an issue on GitHub.

DANSK: Fine-grained Named Entity Recognition#

DANSK is annotated from the Danish Gigaword Corpus [Derczynski et al., 2021] and a wide variety of domains including conversational, legal, news, social media, web content, wiki’s and Books. Dansk follows includes the following labels:

Entity

Description

PERSON

People, including fictional

NORP

Nationalities or religious or political groups

FACILITY

Building, airports, highways, bridges, etc.

ORGANIZATION

Companies, agencies, institutions, etc.

GPE

Countries, cities, states.

LOCATION

Non-GPE locations, mountain ranges, bodies of water

PRODUCT

Vehicles, weapons, foods, etc. (not services)

EVENT

Named hurricanes, battles, wars, sports events, etc.

WORK OF ART

Titles of books, songs, etc.

LAW

Named documents made into laws

LANGUAGE

Any named language

As well as annotation for the following concepts:

Entity

Description

DATE

Absolute or relative dates or periods

TIME

Times smaller than a day

PERCENT

Percentage (including “*”%)

MONEY

Monetary values, including unit

QUANTITY

Measurements, as of weight or distance

ORDINAL

“first”, “second”

CARDINAL

Numerals that do no fall under another type

We have here opted to create an interactive chart over a table as with the number of labels it quickly becomes unruly. The chart is interactive and you can select the label you want to compare the models on. You can also hover over the dots the see the exact values.

F1 score with 95% confidence interval calculated using bootstrapping with 100 samples.
    Fine-grained Models
    Large 0.1.0 Medium 0.1.0 Small 0.1.0
Entities Event 43.5 (27.0, 56.0) 64.2 (50.0, 79.4) 46.1 (27.8, 62.4)
Facility 69.8 (54.3, 84.4) 72.3 (56.2, 84.6) 55.5 (36.2, 70.5)
GPE 90.6 (87.2, 93.1) 88.0 (82.7, 92.1) 79.6 (73.0, 84.6)
Language 74.5 (60.0, 83.3) 51.9 (23.3, 100.0) 45.9 (13.3, 93.3)
Law 54.2 (38.1, 72.5) 59.3 (37.4, 77.3) 57.6 (39.6, 75.1)
Location 75.3 (66.9, 83.8) 72.5 (62.1, 80.8) 65.6 (55.4, 74.1)
NORP 84.8 (76.9, 90.8) 78.2 (68.6, 85.8) 73.3 (62.9, 81.5)
Ordinal 37.8 (22.5, 51.2) 68.7 (49.1, 82.6) 68.5 (47.6, 83.1)
Organization 79.5 (74.9, 83.1) 80.5 (78.1, 84.2) 79.1 (75.7, 82.3)
Person 85.9 (82.7, 88.8) 84.8 (80.6, 88.2) 86.8 (83.2, 90.1)
Product 62.4 (53.9, 72.0) 62.6 (53.9, 71.6) 59.5 (48.9, 67.9)
Work of Art 39.3 (25.5, 50.3) 58.4 (48.7, 69.1) 46.6 (36.2, 56.9)
Non-Entities Cardinal 87.0 (82.8, 90.3) 80.5 (77.0, 84.4) 89.2 (86.0, 91.7)
Date 77.3 (71.6, 81.8) 77.6 (72.8, 82.2) 78.8 (73.9, 83.4)
Money 99.3 (97.9, 100.0) 98.6 (97.2, 100.0) 95.2 (90.0, 98.2)
Percent 100.0 (100.0, 100.0) 100.0 (100.0, 100.0) 100.0 (100.0, 100.0)
Quantity 78.6 (59.8, 93.8) 76.9 (63.9, 89.9) 71.3 (50.0, 91.1)
Time 90.9 (83.8, 96.7) 85.1 (74.0, 93.7) 83.4 (68.0, 95.6)
Average Average 80.1 (78.2, 81.9) 79.7 (77.7, 81.5) 78.4 (76.3, 80.4)

Domain Generalization#

For the domains generalization benchmark we utilize the DANSK dataset. This dataset is annotated across many different domains including fiction, web content, social media, wikis, news, legal and conversational data. As some models are trained on DANSK (da_dacy_{size}_ner_fine_grained-{version}) these models are tested on the test set using all of the labels.

Domain generalization using CoNLL-2003 format#

To test the generalization we here convert the annotations to the CoNLL-2003 format using the labels Person, Location, Organization. As CoNLL-2003, Location includes cities, roads, mountains, abstract places, specific buildings, and meeting points. Thus the GPE (geo-political entity) were converted to Location. The MISC category in CoNLL-2003 is a diverse category meant to denote all names not in other categories (encapsulating both e.g. events and adjectives such as ”2004 World Cup” and ”Italian”), and is therefore not included.

Biases#

To examine the biases in Danish models we use augmentation to replace names in the Danish dataset DaNE [Hvingelby et al., 2020], this approach is similar to that introduced in the initial DaCy paper [Enevoldsen et al., 2021].

Here is a short example of how the augmentation might look like:

Example

Original

Peter Schmeichel mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England.

Female name augmentation

Anne Østergaard mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England.

F1 score for each augmentation with 95% confidence interval calculated over 100 repetitions
Models Augmentation
Baseline Danish Names Female Names Male Names Muslim Names
saattrupdan/nbailab-base-ner-scandi 86.3 (82.4, 89.7) 89.0 (86.8, 91.1) 88.9 (86.9, 91.1) 88.9 (86.9, 91.1) 88.1 (85.9, 90.4)
da_dacy_large_trf-0.2.0 85.4 (81.2, 88.9) 87.7 (85.2, 90.4) 87.8 (85.2, 90.2) 87.5 (84.3, 90.3) 85.6 (82.9, 88.3)
da_dacy_medium_trf-0.2.0 84.9 (81.0, 88.5) 86.2 (83.9, 88.8) 86.1 (83.8, 89.1) 86.1 (83.6, 89.2) 84.2 (81.7, 87.4)
da_dacy_small_trf-0.2.0 82.7 (79.3, 85.9) 82.4 (79.6, 85.3) 82.2 (79.9, 84.7) 82.1 (79.2, 85.2) 81.2 (78.6, 83.7)
alexandrainst/da-ner-base 70.7 (66.2, 75.2) 81.5 (78.2, 84.4) 81.6 (78.3, 84.4) 81.5 (78.2, 84.4) 79.8 (76.7, 82.4)
da_core_news_trf-3.5.0 79.0 (75.1, 82.3) 80.7 (77.2, 83.1) 80.9 (78.1, 83.8) 80.6 (77.3, 83.8) 78.7 (75.8, 81.1)
da_core_news_lg-3.5.0 74.6 (70.8, 78.1) 78.3 (75.5, 80.7) 78.5 (75.9, 81.1) 78.4 (75.4, 81.2) 68.2 (65.4, 71.2)
da_core_news_md-3.5.0 71.2 (66.9, 75.2) 75.7 (71.9, 78.7) 75.6 (72.2, 79.1) 75.5 (72.3, 78.9) 64.6 (60.5, 68.1)
da_core_news_sm-3.5.0 64.4 (59.7, 68.5) 58.8 (55.5, 62.0) 59.1 (56.2, 62.6) 59.1 (56.4, 62.3) 53.4 (50.2, 56.4)

Robustness#

In the paper ‘DaCy: A Unified Framework for Danish NLP’ [Enevoldsen et al., 2021] we conducted a series on augmentation on the DaNE test set to estimate the robustness and biases of DaCy and other Danish language processing pipelines. This page represents only parts of the paper. We recommend reading the paper for a more thorough and nuanced overview.

The augmentation we will be using in this test are performed on the DaNE test set and include the following:

  • Spelling Error: Intended to similar domains with inconsistent spelling, OCR errors, conversational data, etc.. The augmentation includes a series of smaller augmentation:

    • Keystroke error: The augmentation is used to introduce errors by replacing a character with a character that is close on the keyboard.

    • Character swap: The augmentation is used to introduce errors by swapping two neighboring characters.

    • Token swap: The augmentation is used to introduce errors by swapping two neighboring tokens.

  • Inconsistent Casing: This augmentation is used to simulate inconsistent casing in the language and uses two different methods by either randomly capitalizing or lowercasing tokens.

  • Synonym Augmentation: This augmentation is used to simulate the variation and slight grammatical errors in the language and uses two different methods:

    • Wordnet Synonym replacement: The augmentation replaces a token with a synonym in WordNet while respecting its syntactic role.

    • Embedding Synonym replacement: This augmentation replaces a token with a synonym which tends to appear in similar contexts.

  • Inconsistent Spacing: This augmentation is used to simulate inconsistent spacing in the language and uses two different methods by either randomly adding or removing spaces.

  • Historical Spelling: This augmentation is used to simulate historical spelling in Danish including ASCII spellings of the letters Æ (Ae), Ø (Oe), and Å (Aa) as well as uppercasing nouns.

For all of the augmentations the probability of an augmentation is set to augment 5% of the spaces where the targeted augmentation can take place. The augmentations are performed using the augmenty.

The underlying assumption of making these augmentations is that the annotations of the tokens do not change with augmentation. This can naturally sometimes be the case. A single letter “hun læste gåden” (she read the puzzle) and “hun løste gåden” (she solved the puzzle) have quite a different meaning. So while we expect the performance to drop the degree of the drop is interesting to examine and often in comparison to the other models.

F1 score for each augmentation with 95% confidence interval calculated over 100 repetitions
Models Augmentation
Baseline Historical Spelling Inconsistent Casing Inconsistent Spacing Spelling Error Synonym replacement
saattrupdan/nbailab-base-ner-scandi 86.3 (82.4, 89.7) 81.9 (79.1, 85.0) 86.5 (84.4, 89.0) 78.8 (75.7, 81.6) 73.3 (69.9, 76.8) 87.1 (84.9, 89.6)
da_dacy_large_trf-0.2.0 85.4 (81.2, 88.9) 86.0 (82.8, 88.9) 86.9 (83.9, 89.4) 69.7 (66.4, 72.4) 59.7 (56.4, 63.9) 85.9 (82.9, 88.8)
da_dacy_medium_trf-0.2.0 84.9 (81.0, 88.5) 69.6 (66.7, 72.1) 83.7 (81.3, 86.3) 70.5 (66.6, 74.0) 65.4 (62.6, 68.5) 85.1 (82.5, 88.3)
da_dacy_small_trf-0.2.0 82.7 (79.3, 85.9) 51.7 (49.1, 54.6) 81.1 (78.6, 83.5) 64.3 (60.4, 67.2) 63.1 (59.9, 66.5) 83.4 (81.0, 85.7)
alexandrainst/da-ner-base 70.7 (66.2, 75.2) 78.7 (75.3, 81.6) 80.8 (77.6, 83.2) 63.4 (59.4, 66.3) 49.9 (47.3, 53.6) 80.1 (77.1, 82.8)
da_core_news_trf-3.5.0 79.0 (75.1, 82.3) 75.1 (72.4, 77.3) 81.3 (78.5, 84.1) 58.9 (55.8, 62.3) 41.2 (38.5, 44.0) 80.4 (77.6, 83.3)
da_core_news_lg-3.5.0 74.6 (70.8, 78.1) 47.0 (44.5, 49.7) 74.5 (71.6, 77.7) 51.1 (48.1, 53.8) 44.9 (42.0, 47.9) 76.3 (73.6, 79.1)
da_core_news_md-3.5.0 71.2 (66.9, 75.2) 48.7 (45.7, 51.6) 71.6 (68.2, 75.4) 51.1 (47.6, 54.3) 41.8 (38.8, 44.7) 72.8 (69.2, 76.1)
da_core_news_sm-3.5.0 64.4 (59.7, 68.5) 31.9 (29.6, 34.1) 61.5 (58.1, 64.6) 46.6 (43.7, 50.4) 49.6 (46.5, 53.0) 64.8 (61.4, 68.1)

Inference Speed#

While performance naturally is important is it also important to know why you might choose one model over another. One of the main reasons for choosing a smaller model is inference speed. The following table shows the inference speed of the different models. The inference speed is measured in words per second (WPS) and is measured on a Apple M1 Pro 16Gb running macOS 13.3.1 (i.e. high-end consumer laptop). The models are tested on the test set of DaNE.

GPU Acceleration

These benchmarks does not use GPU acceleration. If you were to use GPU acceleration the inference speed would be much higher, similarly larger models would benefit more from this acceleration.

Inference speed on DANE test set
Model Words per second Total time (sec)
saattrupdan/nbailab-base-ner-scandi 1438.8 6.97
da_dacy_large_trf-0.2.0 353.3 28.37
da_dacy_medium_trf-0.2.0 770.2 13.01
da_dacy_small_trf-0.2.0 2024.6 4.95
da_dacy_large_ner_fine_grained-0.1.0 567.9 17.65
da_dacy_medium_ner_fine_grained-0.1.0 1670.3 6.00
da_dacy_small_ner_fine_grained-0.1.0 5717.6 1.75
alexandrainst/da-ner-base 1618.7 6.19
da_core_news_trf-3.5.0 1125.1 8.91
da_core_news_lg-3.5.0 31364.7 0.32
da_core_news_md-3.5.0 32571.3 0.31
da_core_news_sm-3.5.0 34624.4 0.29

Note here that the da_dacy_{size}_trf-{version} models from DaCy and the da_core_news_{size}-{version} models from spaCy are multi-task models so performs multiple tasks at once. This means that the inference speed is not directly comparable to the other models.

References#

1

Leon Derczynski, Manuel R Ciosici, Rebekah Baglini, Morten H Christiansen, Jacob Aarup Dalsgaard, Riccardo Fusaroli, Peter Juel Henrichsen, Rasmus Hvingelby, Andreas Kirkedal, Alex Speed Kjeldsen, and others. The danish gigaword corpus. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), 413–421. 2021.

2(1,2)

Kenneth Enevoldsen, Lasse Hansen, and Kristoffer Nielbo. Dacy: a unified framework for danish nlp. arXiv preprint arXiv:2107.05295, 2021.

3(1,2,3)

Rasmus Hvingelby, Amalie Brogaard Pauli, Maria Barrett, Christina Rosted, Lasse Malm Lidegaard, and Anders Søgaard. Dane: a named entity resource for danish. In Proceedings of the 12th Language Resources and Evaluation Conference, 4597–4604. 2020.

4

Barbara Plank, Kristian Nørgaard Jensen, and Rob van der Goot. Dan+: danish nested named entities and lexical normalization. arXiv preprint arXiv:2105.11301, 2021.

5

Erik F Sang and Fien De Meulder. Introduction to the conll-2003 shared task: language-independent named entity recognition. arXiv preprint cs/0306050, 2003.

6

Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, and others. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA, 2013.