Datasets#
dacy.datasets.dane#
This includes the DaNE dataset wrapped and read in as a SpaCy corpus.
- dacy.datasets.dane.dane(save_path=None, splits=['train', 'dev', 'test'], redownload=False, n_sents=1, open_unverified_connection=False, **kwargs)[source]#
Reads the DaNE dataset as a spacy Corpus.
- Parameters
save_path (str, optional) – Path to the DaNE dataset If it does not contain the dataset it is downloaded to the folder. Defaults to None corresponding to dacy.where_is_my_dacy() in the datasets subfolder.
splits (List[str], optional) – Which splits of the dataset should be returned. Possible options include “train”, “dev”, “test”, “all”. Defaults to [“train”, “dev”, “test”].
redownload (bool, optional) – Should the dataset be redownloaded. Defaults to False.
n_sents (int, optional) – Number of sentences per document. Only applied if the dataset is downloaded. Defaults to 1.
open_unverified_connection (bool, optional) – Should you download from an unverified connection. Defaults to False.
force_extension (bool, optional) – Set the extension to the doc regardless of whether it already exists. Defaults to False.
- Returns
Returns a SpaCy corpus or a list thereof.
- Return type
Union[List[Corpus], Corpus]
Example
>>> from dacy.datasets import dane >>> train, dev, test = dane(splits=["train", "dev", "test"])
dacy.datasets.names#
Helper functions for loading name dictionaries for person augmentation.
- dacy.datasets.names.danish_names()[source]#
Returns a dictionary of Danish names.
- Returns
A dictionary of Danish names containing the keys “first_name” and “last_name”. The list is derived from Danmarks statistik (2021).
- Return type
Dict[str, List[str]]
Example
>>> from dacy.datasets import danish_names >>> names = danish_names() >>> names["first_name"] >>> names["last_name"]
- dacy.datasets.names.female_names()[source]#
Returns a dictionary of Danish female names.
- Returns
- A dictionary of names containing the keys “first_name”
and “last_name”. The list is derived from Danmarks statistik (2021).
- Return type
Dict[str, List[str]]
Example
>>> from dacy.datasets import female_names >>> names = female_names() >>> names["first_name"] >>> names["last_name"]
- dacy.datasets.names.load_names(min_count=0, ethnicity=None, gender=None, min_prop_gender=0)[source]#
Loads the names lookup table. Danish are from Danmarks statistik (2021). Muslim names are from Meldgaard (2005), https://nors.ku.dk/publikationer/webpublikationer/muslimske_fornavne/.
- Parameters
min_count (int, optional) – Minimum number of occurences of the name for it to be included. Defaults to 0.
ethnicity (Optional[str], optional) – Which ethnicity should be included. None indicate all is included. Options include “muslim”, “danish”. Defaults to None.
gender (Optional[str], optional) – Which gender should be included. None indicate all is included. Options include “male”, “female”. Defaults to None.
min_prop_gender (float) – minimum probability of a name being a given gender. The probability of a given name being a specific gender is based on the proportion of people with the given name of that gender. Only used when gender is set. Defaults to 0.
- Returns
- A dictionary of names containing the keys
”first_name” and “last_name”.
- Return type
Dict[str, List[str]]
- dacy.datasets.names.male_names()[source]#
Returns a dictionary of Danish male names.
- Returns
- A dictionary of names containing the keys “first_name”
and “last_name”. The list is derived from Danmarks statistik (2021).
- Return type
Dict[str, List[str]]
Example
>>> from dacy.datasets import male_names >>> names = male_names() >>> names["first_name"] >>> names["last_name"]
- dacy.datasets.names.muslim_names()[source]#
Returns a dictionary of Muslim names.
- Returns
- A dictionary of Muslim names containing the keys
”first_name” and “last_name”. The list is derived from Meldgaard (2005), https://nors.ku.dk/publikationer/webpublikationer/muslimske_fornavne/.
- Return type
Dict[str, List[str]]
Example
>>> from dacy.datasets import muslim_names >>> names = muslim_names() >>> names["first_name"] >>> names["last_name"]