Getting started#

Open In Colab

DaCy is built on SpaCy and uses the same pipeline structure. This means that if you are familiar with SpaCy using DaCy should be easy. It also allows you to use other SpaCy models and components with DaCy. Don’t worry if you are not familiar with SpaCy using DaCy is still easy.

Before we start we assume you have installed DaCy and SpaCy if not please check out the installation page.

To use the model you first have to download either the small, medium or large model. To see a list of all available models:

import dacy

for model in dacy.models():
    print(model)
da_dacy_small_trf-0.2.0
da_dacy_medium_trf-0.2.0
da_dacy_large_trf-0.2.0
small
medium
large
da_dacy_small_ner_fine_grained-0.1.0
da_dacy_medium_ner_fine_grained-0.1.0
da_dacy_large_ner_fine_grained-0.1.0
/home/runner/.local/lib/python3.10/site-packages/transformers/utils/generic.py:441: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/home/runner/.local/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(

Note

The name of the indicated language (da), framework (dacy), model size (e.g. small), model type (trf), and model version (e.g. 0.1.0). Using a larger model size will increase the accuracy of the model, but also increase the memory and time needed to run the model.

From here we can now download a model using:

# get the latest medium model:
nlp = dacy.load("small")
Hide code cell output
Defaulting to user installation because normal site-packages is not writeable
Collecting da_dacy_small_trf@ https://huggingface.co/chcaa/da_dacy_small_trf/resolve/0eadea074d5f637e76357c46bbd56451471d0154/da_dacy_small_trf-any-py3-none-any.whl
  Downloading https://huggingface.co/chcaa/da_dacy_small_trf/resolve/0eadea074d5f637e76357c46bbd56451471d0154/da_dacy_small_trf-any-py3-none-any.whl (101.3 MB)
?25l     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/101.3 MB ? eta -:--:--
     ━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━ 38.3/101.3 MB 228.9 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━ 78.9/101.3 MB 214.0 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 101.2/101.3 MB 196.5 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 101.3/101.3 MB 137.3 MB/s eta 0:00:00
?25h
Installing collected packages: da_dacy_small_trf
Successfully installed da_dacy_small_trf-0.2.0
/home/runner/.local/lib/python3.10/site-packages/spacy/util.py:910: UserWarning: [W095] Model 'da_dacy_small_trf' (0.2.0) was trained with spaCy v3.5.2 and may not be 100% compatible with the current version (3.7.5). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/home/runner/.local/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
/home/runner/.local/lib/python3.10/site-packages/spacy_transformers/layers/hf_shim.py:124: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  self._model.load_state_dict(torch.load(filelike, map_location=device))
/home/runner/.local/lib/python3.10/site-packages/spacy_transformers/layers/hf_shim.py:137: UserWarning: Error loading saved torch state_dict with strict=True, likely due to differences between 'transformers' versions. Attempting to load with strict=False as a fallback...

If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current 'transformers' and 'spacy-transformers' versions. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/home/runner/.local/lib/python3.10/site-packages/spacy_transformers/layers/hf_shim.py:139: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  b = torch.load(filelike, map_location=device)
/home/runner/.local/lib/python3.10/site-packages/thinc/shims/pytorch.py:253: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  model.load_state_dict(torch.load(filelike, map_location=device))

Which will download the model and install the model. If the model is already downloaded the model will just be loaded in. Once loaded, DaCy works exactly like any other SpaCy model.

Using this we can now apply DaCy to text with conventional SpaCy syntax where we pass the text through all the components of the nlp pipeline.

See also

DaCy is built using SpaCy, hence you will be able to find a lot of the required documentation for using the pipeline in their very well written documentation.

doc = nlp("DaCy-pakken er en hurtig og effektiv pipeline til dansk sprogprocessering.")
/home/runner/.local/lib/python3.10/site-packages/thinc/shims/pytorch.py:114: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(self._mixed_precision):
/home/runner/.local/lib/python3.10/site-packages/thinc/shims/pytorch.py:114: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.amp.autocast(self._mixed_precision):

Named Entity Recognition#

Named Entity Recognition (NER) is the task of identifying named entities in a text. A named entity is a “real-world object” that’s assigned a name - for example, a person, a country, a product or a book title. DaCy can recognize organizations, persons, and location, as well as other miscellaneous entities.

for entity in doc.ents:
    print(entity, ":", entity.label_)
DaCy-pakken : MISC
dansk : MISC

We can also plot these using:

from spacy import displacy

displacy.render(doc, style="ent")
DaCy-pakken MISC NIL er en hurtig og effektiv pipeline til dansk MISC Q35 sprogprocessering.

While at the time of its release DaCy achieved state-of-the-art performance it has since been outperformed by the NER model by Dan Nielsen. To allow users to access the best model for their use-case DaCy allows you to easily switch the NER component to obtain a state-of-the-art model.

To do this you can simply load the model using:

# load the small dacy model excluding the NER component
nlp = dacy.load("small", exclude=["ner"])
# or use an empty spacy model if you only want to do NER
# nlp = spacy.blank("da")

# add the ner component from the state-of-the-art model
nlp.add_pipe("dacy/ner")
Hide code cell output
Defaulting to user installation because normal site-packages is not writeable
Collecting da_dacy_small_trf@ https://huggingface.co/chcaa/da_dacy_small_trf/resolve/0eadea074d5f637e76357c46bbd56451471d0154/da_dacy_small_trf-any-py3-none-any.whl
  Downloading https://huggingface.co/chcaa/da_dacy_small_trf/resolve/0eadea074d5f637e76357c46bbd56451471d0154/da_dacy_small_trf-any-py3-none-any.whl (101.3 MB)
?25l     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/101.3 MB ? eta -:--:--
     ━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━ 34.9/101.3 MB 211.8 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━ 85.2/101.3 MB 232.5 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 101.2/101.3 MB 232.0 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 101.3/101.3 MB 155.2 MB/s eta 0:00:00
?25h
/home/runner/.local/lib/python3.10/site-packages/spacy/util.py:910: UserWarning: [W095] Model 'da_dacy_small_trf' (0.2.0) was trained with spaCy v3.5.2 and may not be 100% compatible with the current version (3.7.5). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/home/runner/.local/lib/python3.10/site-packages/spacy_transformers/layers/hf_shim.py:124: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  self._model.load_state_dict(torch.load(filelike, map_location=device))
/home/runner/.local/lib/python3.10/site-packages/spacy_transformers/layers/hf_shim.py:137: UserWarning: Error loading saved torch state_dict with strict=True, likely due to differences between 'transformers' versions. Attempting to load with strict=False as a fallback...

If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current 'transformers' and 'spacy-transformers' versions. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
/home/runner/.local/lib/python3.10/site-packages/spacy_transformers/layers/hf_shim.py:139: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  b = torch.load(filelike, map_location=device)
/home/runner/.local/lib/python3.10/site-packages/thinc/shims/pytorch.py:253: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  model.load_state_dict(torch.load(filelike, map_location=device))
/home/runner/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py:797: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/home/runner/.local/lib/python3.10/site-packages/transformers/utils/generic.py:309: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  _torch_pytree._register_pytree_node(
<spacy_wrap.pipeline_component_tok_clf.TokenClassificationTransformer at 0x7f82b83e4400>
doc = nlp("Denne NER model er trænet af Dan fra Alexandra Instituttet")

displacy.render(doc, style="ent")
Denne NER MISC model er trænet af Dan PER fra Alexandra Instituttet ORG

Warning

Note that this will add an additonal model to your pipeline, which will slow down the inference speed.

Named Entity Linking#

As you probably already saw the named entities are annotated with a unique identifier. This is because DaCy also supports named entity linking.

Named entity linking is the task of linking a named entity to a knowledge base. This is done by assigning a unique identifier to each entity. This allows us to link entities to other entities and extract information from the knowledge base. For example, we can link the entity “Barack Obama” to the Wikipedia or wikidata page about Barack Obama. Named entity linking is also known as named entity disambiguation, though this term could also refer to the task of distinguishing between entities with the same name without linking to a knowledge base.

Beta feature

Named entity linking is currently in beta and is not yet fully tested. If you find any bugs please report them on github. We are working on expanding the knowledge-base as well as correcting the annotations, which currently annotates unknown persons using the QID for the correspondig name. For instance in the sentence Rutechef Ivan Madsen: “Jeg ved ikke hvorfor… the name Ivan Madsen is annotated using two QID’s Q830350 (Ivan, male name) and Q16876242 (Madsen, family name), which we believe is incorrect as the person is not referring to the last name Madsen, but rather the person with the full name Ivan Madsen. The knowledge is also currently limited and thus while the links you do obtain are often correct the model will often not be able to link all entities to the knowledge base.

In DaCy the small, medium, and large model slhave a named entity linking component. This component uses a neural entity linking to match the entity to a specifc entity in the knowledge base. The knowledge base DaCy uses is currently a combination of Danish and English Wikidata.

from wikidata.client import Client

nlp = dacy.load("small")
text = "Danmarks dronning bor i København"
doc = nlp(text)
Hide code cell output
Defaulting to user installation because normal site-packages is not writeable
Collecting da_dacy_small_trf@ https://huggingface.co/chcaa/da_dacy_small_trf/resolve/0eadea074d5f637e76357c46bbd56451471d0154/da_dacy_small_trf-any-py3-none-any.whl
  Downloading https://huggingface.co/chcaa/da_dacy_small_trf/resolve/0eadea074d5f637e76357c46bbd56451471d0154/da_dacy_small_trf-any-py3-none-any.whl (101.3 MB)
?25l     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/101.3 MB ? eta -:--:--
     ━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━ 34.1/101.3 MB 207.0 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━ 82.6/101.3 MB 225.8 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 101.2/101.3 MB 227.5 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 101.3/101.3 MB 143.8 MB/s eta 0:00:00
?25h
displacy.render(doc, style="ent")


client = Client()  # start wikidata client
for entity in doc.ents:
    print(entity, ":", entity.kb_id_)

    # print the short description derived from wikidata
    wikidata_entry = client.get(entity.kb_id_, load=True)
    print(wikidata_entry.description.get("en"))
    print(wikidata_entry.description.get("da"))
    print(" ")
Danmarks LOC Q35 dronning bor i København LOC Q1748
Danmarks : Q35
country in Northern Europe
nordeuropæisk land
 
København : Q1748
capital and largest city of Denmark
Danmarks hovedstad og største by
 

You can even do more things e.g. extract the information from the knowledge base such as images, associated wikipedia article and so on.

Fine-grained NER#

DaCy also features models with a more fine-grained Named Entity Recognition component. This has been trained on the DANSK. This allows for the detection of 18 classes - namely the following Named Entities:

Tag

Description

PERSON

People, including fictional

NORP

Nationalities or religious or political groups

FACILITY

Building, airports, highways, bridges, etc.

ORGANIZATION

Companies, agencies, institutions, etc.

GPE

Countries, cities, states.

LOCATION

Non-GPE locations, mountain ranges, bodies of water

PRODUCT

Vehicles, weapons, foods, etc. (not services)

EVENT

Named hurricanes, battles, wars, sports events, etc.

WORK OF ART

Titles of books, songs, etc.

LAW

Named documents made into laws

LANGUAGE

Any named language

As well as annotation for the following concepts:

Tag

Description

DATE

Absolute or relative dates or periods

TIME

Times smaller than a day

PERCENT

Percentage (including “*”%)

MONEY

Monetary values, including unit

QUANTITY

Measurements, as of weight or distance

ORDINAL

“first”, “second”

CARDINAL

Numerals that do no fall under another type

The fine-grained NER component may be utilized in an existing pipeline in the following fashion:

# load the small dacy model excluding the NER component
nlp = dacy.load("small", exclude=["ner"])

# add the ner component from the state-of-the-art fine-grained model
nlp.add_pipe("dacy/ner-fine-grained", config={"size": "small"})
# or if you only want to do just NER
# nlp = dacy.load("da_dacy_small_ner_fine_grained-0.1.0")
Hide code cell output
Defaulting to user installation because normal site-packages is not writeable
Collecting da_dacy_small_trf@ https://huggingface.co/chcaa/da_dacy_small_trf/resolve/0eadea074d5f637e76357c46bbd56451471d0154/da_dacy_small_trf-any-py3-none-any.whl
  Downloading https://huggingface.co/chcaa/da_dacy_small_trf/resolve/0eadea074d5f637e76357c46bbd56451471d0154/da_dacy_small_trf-any-py3-none-any.whl (101.3 MB)
?25l     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/101.3 MB ? eta -:--:--
     ━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━ 36.4/101.3 MB 218.9 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━ 86.5/101.3 MB 235.4 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 101.2/101.3 MB 238.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 101.3/101.3 MB 158.0 MB/s eta 0:00:00
?25h
Defaulting to user installation because normal site-packages is not writeable
Collecting da_dacy_small_ner_fine_grained@ https://huggingface.co/chcaa/da_dacy_small_ner_fine_grained/resolve/43fedc5a1b1c1d193f461d13225f217f2ced507d/da_dacy_small_ner_fine_grained-any-py3-none-any.whl
  Downloading https://huggingface.co/chcaa/da_dacy_small_ner_fine_grained/resolve/43fedc5a1b1c1d193f461d13225f217f2ced507d/da_dacy_small_ner_fine_grained-any-py3-none-any.whl (82.7 MB)
?25l     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/82.7 MB ? eta -:--:--
     ━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━ 33.8/82.7 MB 207.6 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺ 81.0/82.7 MB 222.2 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 82.6/82.7 MB 222.2 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 82.6/82.7 MB 222.2 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 82.6/82.7 MB 222.2 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 82.6/82.7 MB 222.2 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 82.6/82.7 MB 222.2 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 82.6/82.7 MB 222.2 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 82.6/82.7 MB 222.2 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 82.6/82.7 MB 222.2 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 82.6/82.7 MB 222.2 MB/s eta 0:00:01
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82.7/82.7 MB 36.8 MB/s eta 0:00:00
?25hInstalling collected packages: da_dacy_small_ner_fine_grained
Successfully installed da_dacy_small_ner_fine_grained-0.1.0
/home/runner/.local/lib/python3.10/site-packages/spacy/util.py:910: UserWarning: [W095] Model 'da_dacy_small_ner_fine_grained' (0.1.0) was trained with spaCy v3.5.0 and may not be 100% compatible with the current version (3.7.5). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
<spacy.pipeline.ner.EntityRecognizer at 0x7f82c8615000>
doc = nlp(
    "Denne model samt 3 andre blev trænet d. 7. marts af Center for Humantities Computing i Aarhus kommune"
)

displacy.render(doc, style="ent")
Denne model samt 3 CARDINAL andre blev trænet d. 7. marts DATE af Center for Humantities Computing ORGANIZATION i Aarhus kommune GPE

Parts-of-speech Tagging#

Part-of-speech tagging (POS) is the task of assigning a part of speech to each word in a text. The part of speech is the grammatical role of a word in a sentence. For example, the word “run” is a verb, and the word “book” is a noun.

After tokenization, DaCy can parse and tag a given Doc. This is where the trained pipeline and its statistical models come in, which enable spaCy to make predictions of which tag or label most likely applies in this context. A trained component includes data that is produced by showing a system enough examples for it to make predictions that generalize across the language – for example, a word following “the” in English is most likely a noun.

print("Token POS-tag")
for token in doc:
    print(f"{token}:\t {token.pos_}")
Token POS-tag
Denne:	 DET
model:	 NOUN
samt:	 CCONJ
3:	 NUM
andre:	 PRON
blev:	 AUX
trænet:	 VERB
d.:	 ADV
7.:	 ADJ
marts:	 NOUN
af:	 ADP
Center:	 NOUN
for:	 ADP
Humantities:	 PROPN
Computing:	 PROPN
i:	 ADP
Aarhus:	 PROPN
kommune:	 NOUN

See also

For more on Part-of-speech tagging see SpaCy’s documentation.

Dependency Parsing#

Dependency parsing is the task of assigning syntactic dependencies between tokens, i.e. identifying the head word of a phrase and the relation between the head and the word. For example, in the sentence “The quick brown fox jumps over the lazy dog”, the word “jumps” is the head of the phrase “quick brown fox”, and the relation between them is “nsubj” (nominal subject).

DaCy features a fast and accurate syntactic dependency parser. In DaCy this dependency parsing is also used for sentence segmentation and detecting noun chunks.

You can see the dependency tree using:

See also

For more on Dependency parsing see SpaCy’s documentation.

doc = nlp("DaCy er en effektiv pipeline til dansk fritekst.")
from spacy import displacy

displacy.render(doc)
DaCy PROPN er AUX en DET effektiv ADJ pipeline NOUN til ADP dansk ADJ fritekst. NOUN nsubj cop det amod case amod nmod

Sentence Segmentation#

Sentence segmentation is the task of splitting a text into sentences. In DaCy this is done using the dependency parser. This makes it very accurate and allows for the detection of sentences that are not separated by a punctuations.

doc = nlp(
    "Sætnings segmentering er en vigtig del af sprogprocessering - Det kan bl.a. benyttes til at opdele lange tekster i mindre bidder uden at miste meningen i hvert sætning."
)

for sent in doc.sents:
    print(sent)
Sætnings segmentering er en vigtig del af sprogprocessering
- Det kan bl.a. benyttes til at opdele lange tekster i mindre bidder uden at miste meningen i hvert sætning.

Noun Chunks#

Noun chunks are “base noun phrases” – flat phrases that have a noun as their head. For example, “the big yellow taxi” and “the quick brown fox” are noun chunks. Noun chunks are “noun-like” words, such as a noun, a pronoun, a proper noun, or a noun phrase, that function as the head of a noun phrase.

Noun chunks are for example used for information extraction, and for finding subjects and objects of verbs.

doc = nlp("DaCy er en hurtig og effektiv pipeline til dansk sprogprocessering.")

for nc in doc.noun_chunks:
    print(nc)
DaCy
en hurtig og effektiv pipeline
dansk sprogprocessering

Lemmatization#

Lemmatization is the task of grouping together the inflected forms of a word so they can be analysed as a single item. For example, the verb “to run” has the base form “run”, and the verb “ran” has the base form “run”.

Lemmatization is for example used for text normalization before training a machine learning model to reduce the number of unique tokens in the training data.

doc = nlp("Normalisering af tekst kan være en god idé.")

for token in doc:
    print(token, token.lemma_)
Normalisering Normalisering
af af
tekst tekst
kan kunne
være være
en en
god god
idé idé
. .

Coreference Resolution#

Coreference resolution is the task of finding all expressions that refer to the same entity in a text. For example, in the sentence “The dog chased the ball because it was shiny”, “it” is referring to the “ball”.

Coreference resolution is for example used for question answering, summarization, conversational agents/chatbots and information extraction where such resolved references can lead to a better semantic representation.

Beta feature

Coreference resolution is currently an experimental feature from spaCy. This is thus only a beta feature in DaCy. We are currently working on improving the performance of the model.

text = "Den 4. november 2020 fik minkavler Henning Christensen og hele familien et chok. Efter et pressemøde, fik han at vide at alle mink i Danmark skulle aflives. Dermed fik han fjernet hans livsgrundlag"
doc = nlp(text)
print("Coreference clusters:")
print(doc.spans)
Coreference clusters:
{'coref_clusters_1': [minkavler Henning Christensen, han, han, hans]}