Extracting Metrics from text using TextDescriptives#
DaCy allows you to use other packages in the spaCy universe as you normally would - just powered by the DaCy models.
The following tutorial shows you how to use DaCy and TextDescriptives to extract a variety of metrics from text. For more information on the metrics that can be extracted, see the TextDescriptives documentation.
In this tutorial we’ll use TextDescriptives and DaCy, to get a quick overview of the SMS Spam Collection Data Set. The dataset contains 5572 SMS messages categorized as ham or spam.
The estute among you will have noticed that this dataset is not Danish. This tutorial simply want to show how to use DaCy and TextDescriptives together and hopefully inspire you to use the tools on your own (Danish) data.
To start, let’s load a dataset and get a bit familiar with it.
from textdescriptives.utils import load_sms_data df = load_sms_data() df.head()
|0||ham||Go until jurong point, crazy.. Available only ...|
|1||ham||Ok lar... Joking wif u oni...|
|2||spam||Free entry in 2 a wkly comp to win FA Cup fina...|
|3||ham||U dun say so early hor... U c already then say...|
|4||ham||Nah I don't think he goes to usf, he lives aro...|
label ham 4825 spam 747 Name: count, dtype: int64
Adding TextDescriptives components to DaCy#
Adding TextDescriptives components to a DaCy pipeline, follows exactly the same procedure as for any spaCy model. Let’s add the
Readability is a component that calculates readability metrics, and
dependency_distance is a component that calculates the average dependency distance between words in a sentence. This can be seen a measure of sentence complexity.
Because we are using a DaCy model, the
dependency_distance component will use the dependency parser from DaCy for its calculations.
import dacy nlp = dacy.load("small") # load the latest version of the small model nlp.add_pipe("textdescriptives/readability") nlp.add_pipe("textdescriptives/dependency_distance")
Show code cell output Hide code cell output
/home/runner/.local/lib/python3.10/site-packages/spacy/util.py:910: UserWarning: [W095] Model 'da_dacy_small_trf' (0.2.0) was trained with spaCy v3.5.2 and may not be 100% compatible with the current version (3.6.1). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate warnings.warn(warn_msg) /home/runner/.local/lib/python3.10/site-packages/spacy/util.py:910: UserWarning: [W095] Model 'da_dacy_small_ner_fine_grained' (0.1.0) was trained with spaCy v3.5.0 and may not be 100% compatible with the current version (3.6.1). If you see errors or degraded performance, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate warnings.warn(warn_msg)
<textdescriptives.components.dependency_distance.DependencyDistance at 0x7f1d96e5fc10>
From now on, whenever we pass a document through the pipeline (
nlp), TextDescriptives will add readability and dependency distance metrics to the document.
Let’s load the data and pass it through the pipeline.
# to speed things up (especially on cpu) let's subsample the data df = df.sample(500) doc = nlp.pipe(df["message"])
import textdescriptives as td # extract the metrics as a dataframe metrics = td.extract_df(doc, include_text=False)
Show code cell output Hide code cell output
Token indices sequence length is longer than the specified maximum sequence length for this model (161 > 128). Running this sequence through the model will result in indexing errors
# join the metrics to the original dataframe df = df.join(metrics, how="left") df.head()
|1174||ham||Yay! You better not have told that to 5 other ...||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||...||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN|
|1818||ham||Am i that much dirty fellow?||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||...||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN|
|4902||ham||\I;m reaching in another 2 stops.\""||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||...||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN|
|2762||ham||ARR birthday today:) i wish him to get more os...||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||...||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN||NaN|
|499||ham||Dear i have reache room||5.0||4.0||2.366432||5.0||5.0||0.0||1.8||1.0||...||NaN||18.0||4.62||7.68||25.0||1.0||2.0||0.0||0.333333||0.0|
5 rows × 28 columns
That’s it! Let’s do a bit of exploratory data analysis to get to know the data a bit more.
Exploratory Data Analysis#
With the metrics extracted, let’s do some quick exploratory data analysis to get a sense of the data. Let us start of by taking a look at the distribution of the readability metrics,
import seaborn as sns sns.boxplot(x="label", y="lix", data=df)
<Axes: xlabel='label', ylabel='lix'>
Let’s run a quick test to see if any of our metrics correlate strongly with the label
# encode the label as a boolean df["is_ham"] = df["label"] == "ham" # compute the correlation between all metrics and the label metrics_correlations = metrics.corrwith(df["is_ham"]).sort_values(key=abs, ascending=False) metrics_correlations[:10]
/home/runner/.local/lib/python3.10/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide c /= stddev[:, None] /home/runner/.local/lib/python3.10/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide c /= stddev[None, :]
token_length_median -0.188698 gunning_fog -0.163164 syllables_per_token_median 0.115249 n_unique_tokens 0.110283 n_characters 0.105183 lix -0.104429 token_length_std -0.101153 token_length_mean -0.097317 n_tokens 0.093519 coleman_liau_index -0.082565 dtype: float64
That’s some pretty high correlations! Notably we see that the mean dependency distance is correlated with
ham. This makes sense, as the dependency distance is a measure of sentence complexity, and spam messages tend to be shorter and simpler.
Let’s try to plot it:
sns.kdeplot(df, x="dependency_distance_mean", hue="label", fill=True)
<Axes: xlabel='dependency_distance_mean', ylabel='Density'>
We can do a similar thing for the
lix score, where we see that here isn’t a big difference between the two classes:
sns.kdeplot(df, x="lix", hue="label", fill=True)
<Axes: xlabel='lix', ylabel='Density'>
Cool! We’ve now done a quick analysis of the SMS dataset and found some differences in the distributions of some readability and dependency-distance metrics between the actual SMS’s and spam.
Next steps could be continue the exploratory data analysis or to build a simple classifier using the extracted metrics.