Getting started#

DaCy is built on SpaCy and uses the same pipeline structure. This means that if you are familiar with SpaCy using DaCy should be easy. It also allows you to use other SpaCy models and components with DaCy. Don’t worry if you are not familiar with SpaCy using DaCy is still easy.

Before we start we assume you have installed DaCy and SpaCy if not please check out the installation page.

To use the model you first have to download either the small, medium or large model. To see a list of all available models:

import dacy

for model in dacy.models():
    print(model)

da_dacy_small_trf-0.2.0
da_dacy_medium_trf-0.2.0
da_dacy_large_trf-0.2.0
small
medium
large
da_dacy_small_ner_fine_grained-0.1.0
da_dacy_medium_ner_fine_grained-0.1.0
da_dacy_large_ner_fine_grained-0.1.0

/home/runner/.local/lib/python3.10/site-packages/transformers/utils/generic.py:441: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/home/runner/.local/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(

Note

The name of the indicated language (da), framework (dacy), model size (e.g. small), model type (trf), and model version (e.g. 0.1.0). Using a larger model size will increase the accuracy of the model, but also increase the memory and time needed to run the model.

From here we can now download a model using:

# get the latest medium model:
nlp = dacy.load("small")

Show code cell output Hide code cell output

Defaulting to user installation because normal site-packages is not writeable

Collecting da_dacy_small_trf@ https://huggingface.co/chcaa/da_dacy_small_trf/resolve/0eadea074d5f637e76357c46bbd56451471d0154/da_dacy_small_trf-any-py3-none-any.whl

  Downloading https://huggingface.co/chcaa/da_dacy_small_trf/resolve/0eadea074d5f637e76357c46bbd56451471d0154/da_dacy_small_trf-any-py3-none-any.whl (101.3 MB)
?25l     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/101.3 MB ? eta -:--:--

     ━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.4/101.3 MB 73.9 MB/s eta 0:00:02

     ━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━ 38.0/101.3 MB 95.2 MB/s eta 0:00:01

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━ 73.7/101.3 MB 122.6 MB/s eta 0:00:01

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 101.2/101.3 MB 142.3 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 101.3/101.3 MB 109.0 MB/s eta 0:00:00
?25h

Installing collected packages: da_dacy_small_trf

Successfully installed da_dacy_small_trf-0.2.0

/home/runner/.local/lib/python3.10/site-packages/spacy/util.py:910: UserWarning: [W095] Model 'da_dacy_small_trf' (0.2.0) was trained with spaCy v3.5.2 and may not be 100% compatible with the current version (3.7.5). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)

/home/runner/.local/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(

/home/runner/.local/lib/python3.10/site-packages/spacy_transformers/layers/hf_shim.py:124: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  self._model.load_state_dict(torch.load(filelike, map_location=device))
/home/runner/.local/lib/python3.10/site-packages/spacy_transformers/layers/hf_shim.py:137: UserWarning: Error loading saved torch state_dict with strict=True, likely due to differences between 'transformers' versions. Attempting to load with strict=False as a fallback...

If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current 'transformers' and 'spacy-transformers' versions. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/home/runner/.local/lib/python3.10/site-packages/spacy_transformers/layers/hf_shim.py:139: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  b = torch.load(filelike, map_location=device)
/home/runner/.local/lib/python3.10/site-packages/thinc/shims/pytorch.py:253: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  model.load_state_dict(torch.load(filelike, map_location=device))

Which will download the model and install the model. If the model is already downloaded the model will just be loaded in. Once loaded, DaCy works exactly like any other SpaCy model.

Using this we can now apply DaCy to text with conventional SpaCy syntax where we pass the text through all the components of the nlp pipeline.

Named Entity Recognition#

Named Entity Recognition (NER) is the task of identifying named entities in a text. A named entity is a “real-world object” that’s assigned a name - for example, a person, a country, a product or a book title. DaCy can recognize organizations, persons, and location, as well as other miscellaneous entities.

for entity in doc.ents:
    print(entity, ":", entity.label_)

DaCy-pakken : MISC
dansk : MISC

We can also plot these using:

from spacy import displacy

displacy.render(doc, style="ent")

DaCy-pakken MISC NIL er en hurtig og effektiv pipeline til dansk MISC Q35 sprogprocessering.

While at the time of its release DaCy achieved state-of-the-art performance it has since been outperformed by the NER model by Dan Nielsen. To allow users to access the best model for their use-case DaCy allows you to easily switch the NER component to obtain a state-of-the-art model.

To do this you can simply load the model using:

# load the small dacy model excluding the NER component
nlp = dacy.load("small", exclude=["ner"])
# or use an empty spacy model if you only want to do NER
# nlp = spacy.blank("da")

# add the ner component from the state-of-the-art model
nlp.add_pipe("dacy/ner")

Show code cell output Hide code cell output

Defaulting to user installation because normal site-packages is not writeable

Collecting da_dacy_small_trf@ https://huggingface.co/chcaa/da_dacy_small_trf/resolve/0eadea074d5f637e76357c46bbd56451471d0154/da_dacy_small_trf-any-py3-none-any.whl
  Downloading https://huggingface.co/chcaa/da_dacy_small_trf/resolve/0eadea074d5f637e76357c46bbd56451471d0154/da_dacy_small_trf-any-py3-none-any.whl (101.3 MB)
?25l     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/101.3 MB ? eta -:--:--

     ━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━ 46.7/101.3 MB 238.2 MB/s eta 0:00:01

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━ 96.5/101.3 MB 242.8 MB/s eta 0:00:01

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 101.2/101.3 MB 243.3 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 101.3/101.3 MB 160.1 MB/s eta 0:00:00
?25h

/home/runner/.local/lib/python3.10/site-packages/spacy/util.py:910: UserWarning: [W095] Model 'da_dacy_small_trf' (0.2.0) was trained with spaCy v3.5.2 and may not be 100% compatible with the current version (3.7.5). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)

/home/runner/.local/lib/python3.10/site-packages/spacy_transformers/layers/hf_shim.py:124: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  self._model.load_state_dict(torch.load(filelike, map_location=device))
/home/runner/.local/lib/python3.10/site-packages/spacy_transformers/layers/hf_shim.py:137: UserWarning: Error loading saved torch state_dict with strict=True, likely due to differences between 'transformers' versions. Attempting to load with strict=False as a fallback...

If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current 'transformers' and 'spacy-transformers' versions. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/home/runner/.local/lib/python3.10/site-packages/spacy_transformers/layers/hf_shim.py:139: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  b = torch.load(filelike, map_location=device)
/home/runner/.local/lib/python3.10/site-packages/thinc/shims/pytorch.py:253: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  model.load_state_dict(torch.load(filelike, map_location=device))

/home/runner/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(

/home/runner/.local/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(

<spacy_wrap.pipeline_component_tok_clf.TokenClassificationTransformer at 0x7f13cb5b4520>

doc = nlp("Denne NER model er trænet af Dan fra Alexandra Instituttet")

displacy.render(doc, style="ent")

Denne NER MISC model er trænet af Dan PER fra Alexandra Instituttet ORG

Warning

Note that this will add an additonal model to your pipeline, which will slow down the inference speed.

Named Entity Linking#

As you probably already saw the named entities are annotated with a unique identifier. This is because DaCy also supports named entity linking.

Named entity linking is the task of linking a named entity to a knowledge base. This is done by assigning a unique identifier to each entity. This allows us to link entities to other entities and extract information from the knowledge base. For example, we can link the entity “Barack Obama” to the Wikipedia or wikidata page about Barack Obama. Named entity linking is also known as named entity disambiguation, though this term could also refer to the task of distinguishing between entities with the same name without linking to a knowledge base.

Beta feature

Named entity linking is currently in beta and is not yet fully tested. If you find any bugs please report them on github. We are working on expanding the knowledge-base as well as correcting the annotations, which currently annotates unknown persons using the QID for the correspondig name. For instance in the sentence Rutechef Ivan Madsen: “Jeg ved ikke hvorfor… the name Ivan Madsen is annotated using two QID’s Q830350 (Ivan, male name) and Q16876242 (Madsen, family name), which we believe is incorrect as the person is not referring to the last name Madsen, but rather the person with the full name Ivan Madsen. The knowledge is also currently limited and thus while the links you do obtain are often correct the model will often not be able to link all entities to the knowledge base.

In DaCy the small, medium, and large model slhave a named entity linking component. This component uses a neural entity linking to match the entity to a specifc entity in the knowledge base. The knowledge base DaCy uses is currently a combination of Danish and English Wikidata.

from wikidata.client import Client

nlp = dacy.load("small")
text = "Danmarks dronning bor i København"
doc = nlp(text)

displacy.render(doc, style="ent")


client = Client()  # start wikidata client
for entity in doc.ents:
    print(entity, ":", entity.kb_id_)

    # print the short description derived from wikidata
    wikidata_entry = client.get(entity.kb_id_, load=True)
    print(wikidata_entry.description.get("en"))
    print(wikidata_entry.description.get("da"))
    print(" ")

Danmarks LOC Q35 dronning bor i København LOC Q1748

Danmarks : Q35

country in Northern Europe
nordeuropæisk land
 
København : Q1748

capital and largest city of Denmark
Danmarks hovedstad og største by
 

You can even do more things e.g. extract the information from the knowledge base such as images, associated wikipedia article and so on.

Fine-grained NER#

DaCy also features models with a more fine-grained Named Entity Recognition component. This has been trained on the DANSK. This allows for the detection of 18 classes - namely the following Named Entities:

Tag	Description
PERSON	People, including fictional
NORP	Nationalities or religious or political groups
FACILITY	Building, airports, highways, bridges, etc.
ORGANIZATION	Companies, agencies, institutions, etc.
GPE	Countries, cities, states.
LOCATION	Non-GPE locations, mountain ranges, bodies of water
PRODUCT	Vehicles, weapons, foods, etc. (not services)
EVENT	Named hurricanes, battles, wars, sports events, etc.
WORK OF ART	Titles of books, songs, etc.
LAW	Named documents made into laws
LANGUAGE	Any named language

As well as annotation for the following concepts:

Tag	Description
DATE	Absolute or relative dates or periods
TIME	Times smaller than a day
PERCENT	Percentage (including “*”%)
MONEY	Monetary values, including unit
QUANTITY	Measurements, as of weight or distance
ORDINAL	“first”, “second”
CARDINAL	Numerals that do no fall under another type

The fine-grained NER component may be utilized in an existing pipeline in the following fashion:

# load the small dacy model excluding the NER component
nlp = dacy.load("small", exclude=["ner"])

# add the ner component from the state-of-the-art fine-grained model
nlp.add_pipe("dacy/ner-fine-grained", config={"size": "small"})
# or if you only want to do just NER
# nlp = dacy.load("da_dacy_small_ner_fine_grained-0.1.0")

Show code cell output Hide code cell output

Defaulting to user installation because normal site-packages is not writeable
Collecting da_dacy_small_trf@ https://huggingface.co/chcaa/da_dacy_small_trf/resolve/0eadea074d5f637e76357c46bbd56451471d0154/da_dacy_small_trf-any-py3-none-any.whl

  Downloading https://huggingface.co/chcaa/da_dacy_small_trf/resolve/0eadea074d5f637e76357c46bbd56451471d0154/da_dacy_small_trf-any-py3-none-any.whl (101.3 MB)
?25l     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/101.3 MB ? eta -:--:--

     ━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 49.3/101.3 MB 253.6 MB/s eta 0:00:01

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━ 98.0/101.3 MB 247.0 MB/s eta 0:00:01

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 101.2/101.3 MB 246.9 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 101.3/101.3 MB 161.3 MB/s eta 0:00:00
?25h

Defaulting to user installation because normal site-packages is not writeable
Collecting da_dacy_small_ner_fine_grained@ https://huggingface.co/chcaa/da_dacy_small_ner_fine_grained/resolve/43fedc5a1b1c1d193f461d13225f217f2ced507d/da_dacy_small_ner_fine_grained-any-py3-none-any.whl

  Downloading https://huggingface.co/chcaa/da_dacy_small_ner_fine_grained/resolve/43fedc5a1b1c1d193f461d13225f217f2ced507d/da_dacy_small_ner_fine_grained-any-py3-none-any.whl (82.7 MB)
?25l     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/82.7 MB ? eta -:--:--

     ━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.1/82.7 MB 97.7 MB/s eta 0:00:01

     ━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━ 36.2/82.7 MB 95.1 MB/s eta 0:00:01

     ━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━ 53.2/82.7 MB 88.8 MB/s eta 0:00:01

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━ 72.4/82.7 MB 91.3 MB/s eta 0:00:01

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 82.6/82.7 MB 90.3 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82.7/82.7 MB 76.0 MB/s eta 0:00:00
?25hInstalling collected packages: da_dacy_small_ner_fine_grained

Successfully installed da_dacy_small_ner_fine_grained-0.1.0

/home/runner/.local/lib/python3.10/site-packages/spacy/util.py:910: UserWarning: [W095] Model 'da_dacy_small_ner_fine_grained' (0.1.0) was trained with spaCy v3.5.0 and may not be 100% compatible with the current version (3.7.5). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)

<spacy.pipeline.ner.EntityRecognizer at 0x7f13c64cf370>

doc = nlp(
    "Denne model samt 3 andre blev trænet d. 7. marts af Center for Humantities Computing i Aarhus kommune"
)

displacy.render(doc, style="ent")

Denne model samt 3 CARDINAL andre blev trænet d. 7. marts DATE af Center for Humantities Computing ORGANIZATION i Aarhus kommune GPE

Parts-of-speech Tagging#

Part-of-speech tagging (POS) is the task of assigning a part of speech to each word in a text. The part of speech is the grammatical role of a word in a sentence. For example, the word “run” is a verb, and the word “book” is a noun.

After tokenization, DaCy can parse and tag a given Doc. This is where the trained pipeline and its statistical models come in, which enable spaCy to make predictions of which tag or label most likely applies in this context. A trained component includes data that is produced by showing a system enough examples for it to make predictions that generalize across the language – for example, a word following “the” in English is most likely a noun.

print("Token POS-tag")
for token in doc:
    print(f"{token}:\t {token.pos_}")

Token POS-tag
Denne:	 DET
model:	 NOUN
samt:	 CCONJ
3:	 NUM
andre:	 PRON
blev:	 AUX
trænet:	 VERB
d.:	 ADV
7.:	 ADJ
marts:	 NOUN
af:	 ADP
Center:	 NOUN
for:	 ADP
Humantities:	 PROPN
Computing:	 PROPN
i:	 ADP
Aarhus:	 PROPN
kommune:	 NOUN

Dependency Parsing#

Dependency parsing is the task of assigning syntactic dependencies between tokens, i.e. identifying the head word of a phrase and the relation between the head and the word. For example, in the sentence “The quick brown fox jumps over the lazy dog”, the word “jumps” is the head of the phrase “quick brown fox”, and the relation between them is “nsubj” (nominal subject).

DaCy features a fast and accurate syntactic dependency parser. In DaCy this dependency parsing is also used for sentence segmentation and detecting noun chunks.

You can see the dependency tree using:

Sentence Segmentation#

Sentence segmentation is the task of splitting a text into sentences. In DaCy this is done using the dependency parser. This makes it very accurate and allows for the detection of sentences that are not separated by a punctuations.

doc = nlp(
    "Sætnings segmentering er en vigtig del af sprogprocessering - Det kan bl.a. benyttes til at opdele lange tekster i mindre bidder uden at miste meningen i hvert sætning."
)

for sent in doc.sents:
    print(sent)

Sætnings segmentering er en vigtig del af sprogprocessering
- Det kan bl.a. benyttes til at opdele lange tekster i mindre bidder uden at miste meningen i hvert sætning.

Noun Chunks#

Noun chunks are “base noun phrases” – flat phrases that have a noun as their head. For example, “the big yellow taxi” and “the quick brown fox” are noun chunks. Noun chunks are “noun-like” words, such as a noun, a pronoun, a proper noun, or a noun phrase, that function as the head of a noun phrase.

Noun chunks are for example used for information extraction, and for finding subjects and objects of verbs.

doc = nlp("DaCy er en hurtig og effektiv pipeline til dansk sprogprocessering.")

for nc in doc.noun_chunks:
    print(nc)

DaCy
en hurtig og effektiv pipeline
dansk sprogprocessering

Lemmatization#

Lemmatization is the task of grouping together the inflected forms of a word so they can be analysed as a single item. For example, the verb “to run” has the base form “run”, and the verb “ran” has the base form “run”.

Lemmatization is for example used for text normalization before training a machine learning model to reduce the number of unique tokens in the training data.

doc = nlp("Normalisering af tekst kan være en god idé.")

for token in doc:
    print(token, token.lemma_)

Normalisering Normalisering
af af
tekst tekst
kan kunne
være være
en en
god god
idé idé
. .

Coreference Resolution#

Coreference resolution is the task of finding all expressions that refer to the same entity in a text. For example, in the sentence “The dog chased the ball because it was shiny”, “it” is referring to the “ball”.

Coreference resolution is for example used for question answering, summarization, conversational agents/chatbots and information extraction where such resolved references can lead to a better semantic representation.

Beta feature

Coreference resolution is currently an experimental feature from spaCy. This is thus only a beta feature in DaCy. We are currently working on improving the performance of the model.

text = "Den 4. november 2020 fik minkavler Henning Christensen og hele familien et chok. Efter et pressemøde, fik han at vide at alle mink i Danmark skulle aflives. Dermed fik han fjernet hans livsgrundlag"
doc = nlp(text)
print("Coreference clusters:")
print(doc.spans)

Coreference clusters:
{'coref_clusters_1': [minkavler Henning Christensen, han, han, hans]}