Datasets#

dacy.datasets.dane#

This includes the DaNE dataset wrapped and read in as a SpaCy corpus.

dacy.datasets.dane.dane(save_path=None, splits=['train', 'dev', 'test'], redownload=False, n_sents=1, open_unverified_connection=False, **kwargs)[source]#

Reads the DaNE dataset as a spacy Corpus.

Parameters
  • save_path (str, optional) – Path to the DaNE dataset If it does not contain the dataset it is downloaded to the folder. Defaults to None corresponding to dacy.where_is_my_dacy() in the datasets subfolder.

  • splits (List[str], optional) – Which splits of the dataset should be returned. Possible options include “train”, “dev”, “test”, “all”. Defaults to [“train”, “dev”, “test”].

  • redownload (bool, optional) – Should the dataset be redownloaded. Defaults to False.

  • n_sents (int, optional) – Number of sentences per document. Only applied if the dataset is downloaded. Defaults to 1.

  • open_unverified_connection (bool, optional) – Should you download from an unverified connection. Defaults to False.

  • force_extension (bool, optional) – Set the extension to the doc regardless of whether it already exists. Defaults to False.

Returns

Returns a SpaCy corpus or a list thereof.

Return type

Union[List[Corpus], Corpus]

Example

>>> from dacy.datasets import dane
>>> train, dev, test = dane(splits=["train", "dev", "test"])

dacy.datasets.names#

Helper functions for loading name dictionaries for person augmentation.

dacy.datasets.names.danish_names()[source]#

Returns a dictionary of Danish names.

Returns

A dictionary of Danish names containing the keys “first_name” and “last_name”. The list is derived from Danmarks statistik (2021).

Return type

Dict[str, List[str]]

Example

>>> from dacy.datasets import danish_names
>>> names = danish_names()
>>> names["first_name"]
>>> names["last_name"]
dacy.datasets.names.female_names()[source]#

Returns a dictionary of Danish female names.

Returns

A dictionary of names containing the keys “first_name”

and “last_name”. The list is derived from Danmarks statistik (2021).

Return type

Dict[str, List[str]]

Example

>>> from dacy.datasets import female_names
>>> names = female_names()
>>> names["first_name"]
>>> names["last_name"]
dacy.datasets.names.load_names(min_count=0, ethnicity=None, gender=None, min_prop_gender=0)[source]#

Loads the names lookup table. Danish are from Danmarks statistik (2021). Muslim names are from Meldgaard (2005), https://nors.ku.dk/publikationer/webpublikationer/muslimske_fornavne/.

Parameters
  • min_count (int, optional) – Minimum number of occurences of the name for it to be included. Defaults to 0.

  • ethnicity (Optional[str], optional) – Which ethnicity should be included. None indicate all is included. Options include “muslim”, “danish”. Defaults to None.

  • gender (Optional[str], optional) – Which gender should be included. None indicate all is included. Options include “male”, “female”. Defaults to None.

  • min_prop_gender (float) – minimum probability of a name being a given gender. The probability of a given name being a specific gender is based on the proportion of people with the given name of that gender. Only used when gender is set. Defaults to 0.

Returns

A dictionary of names containing the keys

”first_name” and “last_name”.

Return type

Dict[str, List[str]]

dacy.datasets.names.male_names()[source]#

Returns a dictionary of Danish male names.

Returns

A dictionary of names containing the keys “first_name”

and “last_name”. The list is derived from Danmarks statistik (2021).

Return type

Dict[str, List[str]]

Example

>>> from dacy.datasets import male_names
>>> names = male_names()
>>> names["first_name"]
>>> names["last_name"]
dacy.datasets.names.muslim_names()[source]#

Returns a dictionary of Muslim names.

Returns

A dictionary of Muslim names containing the keys

”first_name” and “last_name”. The list is derived from Meldgaard (2005), https://nors.ku.dk/publikationer/webpublikationer/muslimske_fornavne/.

Return type

Dict[str, List[str]]

Example

>>> from dacy.datasets import muslim_names
>>> names = muslim_names()
>>> names["first_name"]
>>> names["last_name"]