Evaluating Robustness#
This tutorial walks through how to use Augmenty
/SpaCy
augmenters to evalutate robustness of any NLP pipeline. As an example we’ll start out by evaluating SpaCy small and DaCy small on the test set of DaNE. DaNE is the Danish Dependency treebank tagged for part-of-speech tags, dependency relations and named entities. Lastly we will show how to use this framework on any other type of model using DaNLP’s BERT as an example.
Let us start of with installing the required packages and loading the models and dataset we wish to test on.
Installing packages#
To get started we will first need to install a few packages:
# install models
pip install dacy
python -m spacy download da_core_news_sm
# install augmentation library
pip install "augmenty>1.3.0"
Loading models and data#
from dacy.datasets import dane
# load the DaNE test set
test = dane(splits=["test"])
import dacy
import spacy
# load models
spacy_small = spacy.load("da_core_news_sm")
dacy_small = dacy.load("small")
Estimating performance#
Evaluating models already in the SpaCy
framework is very straightforward. Simply call the score
function on your nlp pipeline and choose which metrics you want to calculate performance for. score
is a wrapper for SpaCy.scorer.Scorer
that outputs a nicely formatted dataframe. score
calculates performance for NER, POS, tokenization, and dependency parsing by default, which can be changed with the score_fn argument.
from dacy.score import score
spacy_baseline = score(test, apply_fn=spacy_small, score_fn=["ents", "pos"])
dacy_baseline = score(test, apply_fn=dacy_small, score_fn=["ents", "pos"])
spacy_baseline
wall_time | ents_p | ents_r | ents_f | ents_per_type_LOC_p | ents_per_type_LOC_r | ents_per_type_LOC_f | ents_per_type_MISC_p | ents_per_type_MISC_r | ents_per_type_MISC_f | ... | ents_per_type_PER_f | ents_per_type_ORG_p | ents_per_type_ORG_r | ents_per_type_ORG_f | ents_excl_MISC_ents_p | ents_excl_MISC_ents_r | ents_excl_MISC_ents_f | pos_acc | tag_acc | k | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.862225 | 0.685598 | 0.605735 | 0.643197 | 0.571429 | 0.666667 | 0.615385 | 0.628571 | 0.545455 | 0.584071 | ... | 0.798898 | 0.677419 | 0.391304 | 0.496063 | 0.701031 | 0.622426 | 0.659394 | 0.947658 | 0.947658 | 0 |
1 rows × 22 columns
dacy_baseline
wall_time | ents_p | ents_r | ents_f | ents_per_type_LOC_p | ents_per_type_LOC_r | ents_per_type_LOC_f | ents_per_type_MISC_p | ents_per_type_MISC_r | ents_per_type_MISC_f | ... | ents_per_type_PER_f | ents_per_type_ORG_p | ents_per_type_ORG_r | ents_per_type_ORG_f | ents_excl_MISC_ents_p | ents_excl_MISC_ents_r | ents_excl_MISC_ents_f | pos_acc | tag_acc | k | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5.808233 | 0.82852 | 0.822581 | 0.82554 | 0.767241 | 0.927083 | 0.839623 | 0.764706 | 0.752066 | 0.758333 | ... | 0.920904 | 0.8 | 0.720497 | 0.75817 | 0.845977 | 0.842105 | 0.844037 | 0.978324 | 0.978972 | 0 |
1 rows × 22 columns
Estimating robustness and biases#
To obtain performance estimates on augmented data, simply provide a list of augmenters as the augmenters
argument.
from augmenty.span.entities import create_per_replace_augmenter_v1
from dacy.datasets import female_names
from spacy.training.augment import create_lower_casing_augmenter
lower_aug = create_lower_casing_augmenter(level=1)
female_name_dict = female_names()
# Augmenter that replaces names with random Danish female names. Keep the format of the name as is (force_pattern_size=False)
# but replace the name with one of the two defined patterns
patterns = [
["firstname"],
["firstname", "lastname"],
["firstname", "firstname", "lastname"],
]
female_aug = create_per_replace_augmenter_v1(female_name_dict, patterns, level=0.1)
spacy_aug = score(
test,
apply_fn=spacy_small,
score_fn=["ents", "pos"],
augmenters=[lower_aug, female_aug],
)
dacy_aug = score(
test,
apply_fn=dacy_small,
score_fn=["ents", "pos"],
augmenters=[lower_aug, female_aug],
)
Show code cell output
/Users/au561649/Github/DaCy/src/dacy/datasets/names.py:53: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
To preserve the previous behavior, use
>>> .groupby(..., group_keys=False)
To adopt the future behavior and silence this warning, use
>>> .groupby(..., group_keys=True)
names = names.groupby(level=0).apply(lambda x: x / float(x.sum()))
import pandas as pd
pd.concat([spacy_baseline, spacy_aug])
wall_time | ents_p | ents_r | ents_f | ents_per_type_LOC_p | ents_per_type_LOC_r | ents_per_type_LOC_f | ents_per_type_MISC_p | ents_per_type_MISC_r | ents_per_type_MISC_f | ... | ents_per_type_PER_f | ents_per_type_ORG_p | ents_per_type_ORG_r | ents_per_type_ORG_f | ents_excl_MISC_ents_p | ents_excl_MISC_ents_r | ents_excl_MISC_ents_f | pos_acc | tag_acc | k | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.862225 | 0.685598 | 0.605735 | 0.643197 | 0.571429 | 0.666667 | 0.615385 | 0.628571 | 0.545455 | 0.584071 | ... | 0.798898 | 0.677419 | 0.391304 | 0.496063 | 0.701031 | 0.622426 | 0.659394 | 0.947658 | 0.947658 | 0 |
0 | 1.873839 | 0.695652 | 0.286738 | 0.406091 | 0.687500 | 0.343750 | 0.458333 | 0.720000 | 0.446281 | 0.551020 | ... | 0.412451 | 0.666667 | 0.124224 | 0.209424 | 0.683871 | 0.242563 | 0.358108 | 0.922885 | 0.922885 | 0 |
0 | 1.699737 | 0.685598 | 0.605735 | 0.643197 | 0.571429 | 0.666667 | 0.615385 | 0.628571 | 0.545455 | 0.584071 | ... | 0.798898 | 0.677419 | 0.391304 | 0.496063 | 0.701031 | 0.622426 | 0.659394 | 0.947658 | 0.947658 | 0 |
3 rows × 22 columns
pd.concat([dacy_baseline, dacy_aug])
wall_time | ents_p | ents_r | ents_f | ents_per_type_LOC_p | ents_per_type_LOC_r | ents_per_type_LOC_f | ents_per_type_MISC_p | ents_per_type_MISC_r | ents_per_type_MISC_f | ... | ents_per_type_PER_f | ents_per_type_ORG_p | ents_per_type_ORG_r | ents_per_type_ORG_f | ents_excl_MISC_ents_p | ents_excl_MISC_ents_r | ents_excl_MISC_ents_f | pos_acc | tag_acc | k | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5.808233 | 0.828520 | 0.822581 | 0.82554 | 0.767241 | 0.927083 | 0.839623 | 0.764706 | 0.752066 | 0.758333 | ... | 0.920904 | 0.800000 | 0.720497 | 0.758170 | 0.845977 | 0.842105 | 0.844037 | 0.978324 | 0.978972 | 0 |
0 | 5.315962 | 0.607143 | 0.213262 | 0.31565 | 0.677419 | 0.218750 | 0.330709 | 0.490566 | 0.429752 | 0.458150 | ... | 0.245283 | 0.740741 | 0.124224 | 0.212766 | 0.744444 | 0.153318 | 0.254269 | 0.933873 | 0.931722 | 0 |
0 | 5.710288 | 0.828520 | 0.822581 | 0.82554 | 0.767241 | 0.927083 | 0.839623 | 0.764706 | 0.752066 | 0.758333 | ... | 0.920904 | 0.800000 | 0.720497 | 0.758170 | 0.845977 | 0.842105 | 0.844037 | 0.978324 | 0.978972 | 0 |
3 rows × 22 columns
In the second row, we see that SpaCy small
is very vulnerable to lower casing as NER recall drops from 0.66 to 0.38. DaCy small
is slightly more robust lower casing, but still suffers. Changing names also leads to a drop in performance for both models.
To better estimate the effect of stochastic augmenters such as those changing names or adding keystroke errors we can use the k
argument in score
to run the augmenter multiple times.
from augmenty.character.replace import create_keystroke_error_augmenter_v1
key_05_aug = create_keystroke_error_augmenter_v1(level=0.5, keyboard="da_qwerty_v1")
spacy_key = score(
test, apply_fn=spacy_small, score_fn=["ents", "pos"], augmenters=[key_05_aug], k=5
)
spacy_key
wall_time | ents_p | ents_r | ents_f | ents_per_type_LOC_p | ents_per_type_LOC_r | ents_per_type_LOC_f | ents_per_type_MISC_p | ents_per_type_MISC_r | ents_per_type_MISC_f | ... | ents_per_type_PER_f | ents_per_type_ORG_p | ents_per_type_ORG_r | ents_per_type_ORG_f | ents_excl_MISC_ents_p | ents_excl_MISC_ents_r | ents_excl_MISC_ents_f | pos_acc | tag_acc | k | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2.173135 | 0.096026 | 0.103943 | 0.099828 | 0.109890 | 0.104167 | 0.106952 | 0.060811 | 0.074380 | 0.066914 | ... | 0.141732 | 0.073171 | 0.074534 | 0.073846 | 0.107456 | 0.112128 | 0.109742 | 0.326630 | 0.326630 | 0 |
1 | 2.117777 | 0.116949 | 0.123656 | 0.120209 | 0.145631 | 0.156250 | 0.150754 | 0.073171 | 0.074380 | 0.073770 | ... | 0.181818 | 0.066298 | 0.074534 | 0.070175 | 0.128480 | 0.137300 | 0.132743 | 0.319308 | 0.319308 | 1 |
2 | 2.094923 | 0.097603 | 0.102151 | 0.099825 | 0.060000 | 0.062500 | 0.061224 | 0.063830 | 0.074380 | 0.068702 | ... | 0.153439 | 0.089655 | 0.080745 | 0.084967 | 0.108352 | 0.109840 | 0.109091 | 0.321187 | 0.321187 | 2 |
3 | 2.070100 | 0.123539 | 0.132616 | 0.127917 | 0.134831 | 0.125000 | 0.129730 | 0.080645 | 0.082645 | 0.081633 | ... | 0.153465 | 0.129630 | 0.130435 | 0.130031 | 0.134737 | 0.146453 | 0.140351 | 0.313382 | 0.313382 | 3 |
4 | 2.069810 | 0.099831 | 0.105735 | 0.102698 | 0.104762 | 0.114583 | 0.109453 | 0.033613 | 0.033058 | 0.033333 | ... | 0.172973 | 0.067797 | 0.074534 | 0.071006 | 0.116525 | 0.125858 | 0.121012 | 0.315617 | 0.315617 | 4 |
5 rows × 22 columns
In this manner, evaluating performance on augmented data for SpaCy pipelines is as easy as defining the augmenters and calling a single function. In the dacy_paper_replication.py
script you can find the exact script used to evaluate the robustness of Danish NLP models in the DaCy paper.