Evaluating Robustness#

This tutorial walks through how to use Augmenty/SpaCy augmenters to evalutate robustness of any NLP pipeline. As an example we’ll start out by evaluating SpaCy small and DaCy small on the test set of DaNE. DaNE is the Danish Dependency treebank tagged for part-of-speech tags, dependency relations and named entities. Lastly we will show how to use this framework on any other type of model using DaNLP’s BERT as an example.

Let us start of with installing the required packages and loading the models and dataset we wish to test on.

Installing packages#

To get started we will first need to install a few packages:

# install models
pip install dacy
python -m spacy download da_core_news_sm

# install augmentation library
pip install "augmenty>1.3.0"

Loading models and data#

from dacy.datasets import dane

# load the DaNE test set
test = dane(splits=["test"])
import dacy
import spacy

# load models
spacy_small = spacy.load("da_core_news_sm")
dacy_small = dacy.load("small")

Estimating performance#

Evaluating models already in the SpaCy framework is very straightforward. Simply call the score function on your nlp pipeline and choose which metrics you want to calculate performance for. score is a wrapper for SpaCy.scorer.Scorer that outputs a nicely formatted dataframe. score calculates performance for NER, POS, tokenization, and dependency parsing by default, which can be changed with the score_fn argument.

from dacy.score import score

spacy_baseline = score(test, apply_fn=spacy_small, score_fn=["ents", "pos"])
dacy_baseline = score(test, apply_fn=dacy_small, score_fn=["ents", "pos"])
spacy_baseline
wall_time ents_p ents_r ents_f ents_per_type_LOC_p ents_per_type_LOC_r ents_per_type_LOC_f ents_per_type_MISC_p ents_per_type_MISC_r ents_per_type_MISC_f ... ents_per_type_PER_f ents_per_type_ORG_p ents_per_type_ORG_r ents_per_type_ORG_f ents_excl_MISC_ents_p ents_excl_MISC_ents_r ents_excl_MISC_ents_f pos_acc tag_acc k
0 1.862225 0.685598 0.605735 0.643197 0.571429 0.666667 0.615385 0.628571 0.545455 0.584071 ... 0.798898 0.677419 0.391304 0.496063 0.701031 0.622426 0.659394 0.947658 0.947658 0

1 rows × 22 columns

dacy_baseline
wall_time ents_p ents_r ents_f ents_per_type_LOC_p ents_per_type_LOC_r ents_per_type_LOC_f ents_per_type_MISC_p ents_per_type_MISC_r ents_per_type_MISC_f ... ents_per_type_PER_f ents_per_type_ORG_p ents_per_type_ORG_r ents_per_type_ORG_f ents_excl_MISC_ents_p ents_excl_MISC_ents_r ents_excl_MISC_ents_f pos_acc tag_acc k
0 5.808233 0.82852 0.822581 0.82554 0.767241 0.927083 0.839623 0.764706 0.752066 0.758333 ... 0.920904 0.8 0.720497 0.75817 0.845977 0.842105 0.844037 0.978324 0.978972 0

1 rows × 22 columns

Estimating robustness and biases#

To obtain performance estimates on augmented data, simply provide a list of augmenters as the augmenters argument.

from augmenty.span.entities import create_per_replace_augmenter_v1
from dacy.datasets import female_names
from spacy.training.augment import create_lower_casing_augmenter
lower_aug = create_lower_casing_augmenter(level=1)
female_name_dict = female_names()
# Augmenter that replaces names with random Danish female names. Keep the format of the name as is (force_pattern_size=False)
# but replace the name with one of the two defined patterns

patterns = [
    ["firstname"],
    ["firstname", "lastname"],
    ["firstname", "firstname", "lastname"],
]
female_aug = create_per_replace_augmenter_v1(female_name_dict, patterns, level=0.1)

spacy_aug = score(
    test,
    apply_fn=spacy_small,
    score_fn=["ents", "pos"],
    augmenters=[lower_aug, female_aug],
)
dacy_aug = score(
    test,
    apply_fn=dacy_small,
    score_fn=["ents", "pos"],
    augmenters=[lower_aug, female_aug],
)
Hide code cell output
/Users/au561649/Github/DaCy/src/dacy/datasets/names.py:53: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)

To adopt the future behavior and silence this warning, use 

	>>> .groupby(..., group_keys=True)
  names = names.groupby(level=0).apply(lambda x: x / float(x.sum()))
import pandas as pd

pd.concat([spacy_baseline, spacy_aug])
wall_time ents_p ents_r ents_f ents_per_type_LOC_p ents_per_type_LOC_r ents_per_type_LOC_f ents_per_type_MISC_p ents_per_type_MISC_r ents_per_type_MISC_f ... ents_per_type_PER_f ents_per_type_ORG_p ents_per_type_ORG_r ents_per_type_ORG_f ents_excl_MISC_ents_p ents_excl_MISC_ents_r ents_excl_MISC_ents_f pos_acc tag_acc k
0 1.862225 0.685598 0.605735 0.643197 0.571429 0.666667 0.615385 0.628571 0.545455 0.584071 ... 0.798898 0.677419 0.391304 0.496063 0.701031 0.622426 0.659394 0.947658 0.947658 0
0 1.873839 0.695652 0.286738 0.406091 0.687500 0.343750 0.458333 0.720000 0.446281 0.551020 ... 0.412451 0.666667 0.124224 0.209424 0.683871 0.242563 0.358108 0.922885 0.922885 0
0 1.699737 0.685598 0.605735 0.643197 0.571429 0.666667 0.615385 0.628571 0.545455 0.584071 ... 0.798898 0.677419 0.391304 0.496063 0.701031 0.622426 0.659394 0.947658 0.947658 0

3 rows × 22 columns

pd.concat([dacy_baseline, dacy_aug])
wall_time ents_p ents_r ents_f ents_per_type_LOC_p ents_per_type_LOC_r ents_per_type_LOC_f ents_per_type_MISC_p ents_per_type_MISC_r ents_per_type_MISC_f ... ents_per_type_PER_f ents_per_type_ORG_p ents_per_type_ORG_r ents_per_type_ORG_f ents_excl_MISC_ents_p ents_excl_MISC_ents_r ents_excl_MISC_ents_f pos_acc tag_acc k
0 5.808233 0.828520 0.822581 0.82554 0.767241 0.927083 0.839623 0.764706 0.752066 0.758333 ... 0.920904 0.800000 0.720497 0.758170 0.845977 0.842105 0.844037 0.978324 0.978972 0
0 5.315962 0.607143 0.213262 0.31565 0.677419 0.218750 0.330709 0.490566 0.429752 0.458150 ... 0.245283 0.740741 0.124224 0.212766 0.744444 0.153318 0.254269 0.933873 0.931722 0
0 5.710288 0.828520 0.822581 0.82554 0.767241 0.927083 0.839623 0.764706 0.752066 0.758333 ... 0.920904 0.800000 0.720497 0.758170 0.845977 0.842105 0.844037 0.978324 0.978972 0

3 rows × 22 columns

In the second row, we see that SpaCy small is very vulnerable to lower casing as NER recall drops from 0.66 to 0.38. DaCy small is slightly more robust lower casing, but still suffers. Changing names also leads to a drop in performance for both models.

To better estimate the effect of stochastic augmenters such as those changing names or adding keystroke errors we can use the k argument in score to run the augmenter multiple times.

from augmenty.character.replace import create_keystroke_error_augmenter_v1

key_05_aug = create_keystroke_error_augmenter_v1(level=0.5, keyboard="da_qwerty_v1")

spacy_key = score(
    test, apply_fn=spacy_small, score_fn=["ents", "pos"], augmenters=[key_05_aug], k=5
)
spacy_key
wall_time ents_p ents_r ents_f ents_per_type_LOC_p ents_per_type_LOC_r ents_per_type_LOC_f ents_per_type_MISC_p ents_per_type_MISC_r ents_per_type_MISC_f ... ents_per_type_PER_f ents_per_type_ORG_p ents_per_type_ORG_r ents_per_type_ORG_f ents_excl_MISC_ents_p ents_excl_MISC_ents_r ents_excl_MISC_ents_f pos_acc tag_acc k
0 2.173135 0.096026 0.103943 0.099828 0.109890 0.104167 0.106952 0.060811 0.074380 0.066914 ... 0.141732 0.073171 0.074534 0.073846 0.107456 0.112128 0.109742 0.326630 0.326630 0
1 2.117777 0.116949 0.123656 0.120209 0.145631 0.156250 0.150754 0.073171 0.074380 0.073770 ... 0.181818 0.066298 0.074534 0.070175 0.128480 0.137300 0.132743 0.319308 0.319308 1
2 2.094923 0.097603 0.102151 0.099825 0.060000 0.062500 0.061224 0.063830 0.074380 0.068702 ... 0.153439 0.089655 0.080745 0.084967 0.108352 0.109840 0.109091 0.321187 0.321187 2
3 2.070100 0.123539 0.132616 0.127917 0.134831 0.125000 0.129730 0.080645 0.082645 0.081633 ... 0.153465 0.129630 0.130435 0.130031 0.134737 0.146453 0.140351 0.313382 0.313382 3
4 2.069810 0.099831 0.105735 0.102698 0.104762 0.114583 0.109453 0.033613 0.033058 0.033333 ... 0.172973 0.067797 0.074534 0.071006 0.116525 0.125858 0.121012 0.315617 0.315617 4

5 rows × 22 columns

In this manner, evaluating performance on augmented data for SpaCy pipelines is as easy as defining the augmenters and calling a single function. In the dacy_paper_replication.py script you can find the exact script used to evaluate the robustness of Danish NLP models in the DaCy paper.