Evaluating Robustness#

This tutorial walks through how to use Augmenty/SpaCy augmenters to evalutate robustness of any NLP pipeline. As an example we’ll start out by evaluating SpaCy small and DaCy small on the test set of DaNE. DaNE is the Danish Dependency treebank tagged for part-of-speech tags, dependency relations and named entities. Lastly we will show how to use this framework on any other type of model using DaNLP’s BERT as an example.

Let us start of with installing the required packages and loading the models and dataset we wish to test on.

Installing packages#

To get started we will first need to install a few packages:

# install models
pip install dacy
python -m spacy download da_core_news_sm

# install augmentation library
pip install "augmenty>1.3.0"

Loading models and data#

from dacy.datasets import dane

# load the DaNE test set
test = dane(splits=["test"])

import dacy
import spacy

# load models
spacy_small = spacy.load("da_core_news_sm")
dacy_small = dacy.load("small")

Estimating performance#

Evaluating models already in the SpaCy framework is very straightforward. Simply call the score function on your nlp pipeline and choose which metrics you want to calculate performance for. score is a wrapper for SpaCy.scorer.Scorer that outputs a nicely formatted dataframe. score calculates performance for NER, POS, tokenization, and dependency parsing by default, which can be changed with the score_fn argument.

from dacy.score import score

spacy_baseline = score(test, apply_fn=spacy_small, score_fn=["ents", "pos"])
dacy_baseline = score(test, apply_fn=dacy_small, score_fn=["ents", "pos"])

spacy_baseline

	wall_time	ents_p	ents_r	ents_f	ents_per_type_LOC_p	ents_per_type_LOC_r	ents_per_type_LOC_f	ents_per_type_MISC_p	ents_per_type_MISC_r	ents_per_type_MISC_f	...	ents_per_type_PER_f	ents_per_type_ORG_p	ents_per_type_ORG_r	ents_per_type_ORG_f	ents_excl_MISC_ents_p	ents_excl_MISC_ents_r	ents_excl_MISC_ents_f	pos_acc	tag_acc	k
0	1.862225	0.685598	0.605735	0.643197	0.571429	0.666667	0.615385	0.628571	0.545455	0.584071	...	0.798898	0.677419	0.391304	0.496063	0.701031	0.622426	0.659394	0.947658	0.947658	0

1 rows × 22 columns

dacy_baseline

	wall_time	ents_p	ents_r	ents_f	ents_per_type_LOC_p	ents_per_type_LOC_r	ents_per_type_LOC_f	ents_per_type_MISC_p	ents_per_type_MISC_r	ents_per_type_MISC_f	...	ents_per_type_PER_f	ents_per_type_ORG_p	ents_per_type_ORG_r	ents_per_type_ORG_f	ents_excl_MISC_ents_p	ents_excl_MISC_ents_r	ents_excl_MISC_ents_f	pos_acc	tag_acc	k
0	5.808233	0.82852	0.822581	0.82554	0.767241	0.927083	0.839623	0.764706	0.752066	0.758333	...	0.920904	0.8	0.720497	0.75817	0.845977	0.842105	0.844037	0.978324	0.978972	0

1 rows × 22 columns

Estimating robustness and biases#

To obtain performance estimates on augmented data, simply provide a list of augmenters as the augmenters argument.

from augmenty.span.entities import create_per_replace_augmenter_v1
from dacy.datasets import female_names
from spacy.training.augment import create_lower_casing_augmenter

lower_aug = create_lower_casing_augmenter(level=1)
female_name_dict = female_names()
# Augmenter that replaces names with random Danish female names. Keep the format of the name as is (force_pattern_size=False)
# but replace the name with one of the two defined patterns

patterns = [
    ["firstname"],
    ["firstname", "lastname"],
    ["firstname", "firstname", "lastname"],
]
female_aug = create_per_replace_augmenter_v1(female_name_dict, patterns, level=0.1)

spacy_aug = score(
    test,
    apply_fn=spacy_small,
    score_fn=["ents", "pos"],
    augmenters=[lower_aug, female_aug],
)
dacy_aug = score(
    test,
    apply_fn=dacy_small,
    score_fn=["ents", "pos"],
    augmenters=[lower_aug, female_aug],
)

import pandas as pd

pd.concat([spacy_baseline, spacy_aug])

wall_time	ents_p	ents_r	ents_f	ents_per_type_LOC_p	ents_per_type_LOC_r	ents_per_type_LOC_f	ents_per_type_MISC_p	ents_per_type_MISC_r	ents_per_type_MISC_f	...	ents_per_type_PER_f	ents_per_type_ORG_p	ents_per_type_ORG_r	ents_per_type_ORG_f	ents_excl_MISC_ents_p	ents_excl_MISC_ents_r	ents_excl_MISC_ents_f	pos_acc	tag_acc
1.862225	0.685598	0.605735	0.643197	0.571429	0.666667	0.615385	0.628571	0.545455	0.584071	...	0.798898	0.677419	0.391304	0.496063	0.701031	0.622426	0.659394	0.947658	0.947658
1.873839	0.695652	0.286738	0.406091	0.687500	0.343750	0.458333	0.720000	0.446281	0.551020	...	0.412451	0.666667	0.124224	0.209424	0.683871	0.242563	0.358108	0.922885	0.922885
1.699737	0.685598	0.605735	0.643197	0.571429	0.666667	0.615385	0.628571	0.545455	0.584071	...	0.798898	0.677419	0.391304	0.496063	0.701031	0.622426	0.659394	0.947658	0.947658

3 rows × 22 columns

pd.concat([dacy_baseline, dacy_aug])

wall_time	ents_p	ents_r	ents_f	ents_per_type_LOC_p	ents_per_type_LOC_r	ents_per_type_LOC_f	ents_per_type_MISC_p	ents_per_type_MISC_r	ents_per_type_MISC_f	...	ents_per_type_PER_f	ents_per_type_ORG_p	ents_per_type_ORG_r	ents_per_type_ORG_f	ents_excl_MISC_ents_p	ents_excl_MISC_ents_r	ents_excl_MISC_ents_f	pos_acc	tag_acc
5.808233	0.828520	0.822581	0.82554	0.767241	0.927083	0.839623	0.764706	0.752066	0.758333	...	0.920904	0.800000	0.720497	0.758170	0.845977	0.842105	0.844037	0.978324	0.978972
5.315962	0.607143	0.213262	0.31565	0.677419	0.218750	0.330709	0.490566	0.429752	0.458150	...	0.245283	0.740741	0.124224	0.212766	0.744444	0.153318	0.254269	0.933873	0.931722
5.710288	0.828520	0.822581	0.82554	0.767241	0.927083	0.839623	0.764706	0.752066	0.758333	...	0.920904	0.800000	0.720497	0.758170	0.845977	0.842105	0.844037	0.978324	0.978972

3 rows × 22 columns

In the second row, we see that SpaCy small is very vulnerable to lower casing as NER recall drops from 0.66 to 0.38. DaCy small is slightly more robust lower casing, but still suffers. Changing names also leads to a drop in performance for both models.

To better estimate the effect of stochastic augmenters such as those changing names or adding keystroke errors we can use the k argument in score to run the augmenter multiple times.

from augmenty.character.replace import create_keystroke_error_augmenter_v1

key_05_aug = create_keystroke_error_augmenter_v1(level=0.5, keyboard="da_qwerty_v1")

spacy_key = score(
    test, apply_fn=spacy_small, score_fn=["ents", "pos"], augmenters=[key_05_aug], k=5
)

spacy_key

	wall_time	ents_p	ents_r	ents_f	ents_per_type_LOC_p	ents_per_type_LOC_r	ents_per_type_LOC_f	ents_per_type_MISC_p	ents_per_type_MISC_r	ents_per_type_MISC_f	...	ents_per_type_PER_f	ents_per_type_ORG_p	ents_per_type_ORG_r	ents_per_type_ORG_f	ents_excl_MISC_ents_p	ents_excl_MISC_ents_r	ents_excl_MISC_ents_f	pos_acc	tag_acc	k
0	2.173135	0.096026	0.103943	0.099828	0.109890	0.104167	0.106952	0.060811	0.074380	0.066914	...	0.141732	0.073171	0.074534	0.073846	0.107456	0.112128	0.109742	0.326630	0.326630	0
1	2.117777	0.116949	0.123656	0.120209	0.145631	0.156250	0.150754	0.073171	0.074380	0.073770	...	0.181818	0.066298	0.074534	0.070175	0.128480	0.137300	0.132743	0.319308	0.319308	1
2	2.094923	0.097603	0.102151	0.099825	0.060000	0.062500	0.061224	0.063830	0.074380	0.068702	...	0.153439	0.089655	0.080745	0.084967	0.108352	0.109840	0.109091	0.321187	0.321187	2
3	2.070100	0.123539	0.132616	0.127917	0.134831	0.125000	0.129730	0.080645	0.082645	0.081633	...	0.153465	0.129630	0.130435	0.130031	0.134737	0.146453	0.140351	0.313382	0.313382	3
4	2.069810	0.099831	0.105735	0.102698	0.104762	0.114583	0.109453	0.033613	0.033058	0.033333	...	0.172973	0.067797	0.074534	0.071006	0.116525	0.125858	0.121012	0.315617	0.315617	4

5 rows × 22 columns

In this manner, evaluating performance on augmented data for SpaCy pipelines is as easy as defining the augmenters and calling a single function. In the dacy_paper_replication.py script you can find the exact script used to evaluate the robustness of Danish NLP models in the DaCy paper.