Accuracy¶
The most common question that people have about scrubadub is:
How accurately can scrubadub detect PII?
It’s a great question that’s hard, but essential to answer.
It is straightforward to measure this on pseudo-data (fake data that is generated), but its not clear how applicable this is to real-world applications. There might be a possibility to use some open real-world datasets, but it’s not clear if such things exist given the sensitivity of PII.
Precision and recall¶
We show the precision and recall for each of the Filth types detected by the various Detectors. Wikipedia has a good explanation, but these are defined as:
Precision: Percentage of true Filth detected out of all Filth selected by the Detector
If this is low, there is lots of clean text incorrectly detected as Filth
Recall: Percentage of the true Filth that is selected by the Detector
If this is low, there is lots of dirty text that is not detected as Filth
f1-score: Combines the information in the precision and recall together (only in classification reports below)
If either precision or recall is low, this will also be low. This is a good summary metric of precision and recall.
Pseudo-data performance¶
This section uses data created by the Faker package to test the effectiveness of the various detectors. Here the detectors all generally perform very well (often 100%) but this will likely not be representative on actual data.
Filth type |
Detector |
Locale |
Precision |
Recall |
---|---|---|---|---|
Address |
Address |
en_GB |
100% |
96% |
Address |
Address |
en_US |
100% |
74% |
N/A |
100% |
100% |
||
Name |
Name |
en_US |
9% |
100% |
Name |
Spacy en_core_web_sm |
en_US |
57% |
90% |
Name |
Spacy en_core_web_md |
en_US |
60% |
95% |
Name |
Spacy en_core_web_lg |
en_US |
53% |
97% |
Name |
Spacy en_core_web_trf |
en_US |
88% |
95% |
Name |
Stanford NER |
en_US |
96% |
90% |
Phone Number |
Phone Number |
en_GB |
100% |
100% |
Phone Number |
Phone Number |
en_US |
100% |
100% |
Postal code |
Postal code |
en_GB |
100% |
74% |
SSN |
SSN |
en_US |
100% |
100% |
N/A |
100% |
100% |
||
URL |
URL |
N/A |
100% |
100% |
Real data performance¶
We are trying to find datasets that could be used to evaluate performance; if you know of any, let us know. Stay tuned for more updates.
Measuring performance¶
Read this section if you want to measure performance on your own data.
First data must be obtained with PII in and then it must be tagged as PII, usually by a human.
The tagged PII should be in a format that is compatible with the scrubadub.detectors.TaggedEvaluationFilthDetector.
The format needed for the TaggedFilthDetector
is identical to the scrubadub.detectors.UserSuppliedFilthDetector which is discussed on the usage page (essentially a list of dictionaries, each containing the text to be found and the type of filth it represents), an example is given below:
[ {"match": "wwashington@reed-ryan.org", "filth_type": "email"}, {"match": "https://www.wong.com/", "filth_type": "url"} ]
In our tests we have found it useful to collect tagged PII in a CSV format that mirrors the structure of the above data; an example of this is shown in tests/example_real_data/known_pii.csv. Together the tagged PII and original text documents can be loaded by a script to calculate the detector efficiencies; an example of such a script is given here tests/benchmark_accuracy_real_data.py A bare-bones version of the script would follow the steps:
Load the documents (list of strings) and tagged PII (list of dictionaries)
Initialise a
Scrubber
, including the detectors and aTaggedFilthDetector
initialised with the tagged PIIGet the list of
Filth
s found by theScrubber
Pass the list of filth to the classification report function and print the result
This is shown in the example below, except we replace loading real data with the make_fake_document()
function, which generates a fake document and fake tagged PII.
If you cannot get real data, you can generate fake data, but this is never as realistic as real data.
>>> import scrubadub, scrubadub.comparison, json >>> document, tagged_pii = scrubadub.comparison.make_fake_document(paragraphs=1, seed=1) >>> print(document[:50], '...') Suggest shake effort many last prepare small. Main ... >>> print(json.dumps(tagged_pii[1:], indent=4)) [ { "match": "wwashington@reed-ryan.org", "filth_type": "email" }, { "match": "https://www.wong.com/", "filth_type": "url" } ] >>> scrubber = scrubadub.Scrubber() >>> scrubber.add_detector(scrubadub.detectors.TaggedEvaluationFilthDetector(known_filth_items=tagged_pii)) >>> filth_list = list(scrubber.iter_filth(document)) >>> print(scrubadub.comparison.get_filth_classification_report(filth_list)) filth detector locale precision recall f1-score support email email en_US 1.00 1.00 1.00 2 url url en_US 1.00 1.00 1.00 1 micro avg 1.00 1.00 1.00 3 macro avg 1.00 1.00 1.00 3 weighted avg 1.00 1.00 1.00 3 samples avg 1.00 1.00 1.00 3
In addition to this classification report, there is also the get_filth_dataframe(filth_list)
function that returns a pandas DataFrame
that can be used to get more information on the types of Filth
that were detected.
The classification report¶
This script above uses the get_filth_classification_report(filth_list)
function to get a report containing the recall and precision of the detectors.
Those familiar with sklearn will notice that it is a slightly modified version of the sklearn classification report.
In the first column we show the Filth.type
(e.g. email, name, address, …), followed by the Detector.name
and Detector.locale
.
After that we have the fields from the classification report, specifically: the precision, recall, f1-score and support (number of items of that type found in the document).
We have previously discussed the TaggedFilthDetector
and the UserSuppliedFilthDetector
, which both look for text in a document and return Filth
.
The difference between them is the type of the Filth
that they return:
The
TaggedFilthDetector
always returnsTaggedEvaluationFilth
.The
UserSuppliedFilthDetector
returns the type of filth specified (e.g.EmailFilth
orPhoneFilth
).
It is the TaggedEvaluationFilth
that is used as the truth when calculating the classification report, while the other types of Filth
are used to show where a Detector
identified filth.
If there is both a TaggedEvaluationFilth
and another type of Filth
at the same location then a detector identified a tagged piece of filth, which means that it is a true positive.
If there is only a TaggedEvaluationFilth
at a location in a document, then is a false negative.
If there is only a Filth
at a location in a document, then is is a false positive.
Using this you can build the classification report.
In the example below there are 4 locations with both a Filth
and a TaggedEvaluationFilth
; these are true positives.
There is also one NameFilth
alone, this is a false positive.
This leads to a precision of 80% (= 4/5) and a recall of 100% (= 4/4) in the classification report.
>>> import scrubadub, scrubadub.comparison, scrubadub_spacy >>> document, known_filth_items = scrubadub.comparison.make_fake_document(paragraphs=1, seed=3, filth_types=['name']) >>> scrubber = scrubadub.Scrubber(detector_list=[scrubadub_spacy.detectors.SpacyNameDetector()]) >>> scrubber.add_detector(scrubadub.detectors.TaggedEvaluationFilthDetector(known_filth_items=known_filth_items)) >>> filth_list = list(scrubber.iter_filth(document)) >>> for filth in filth_list: print(filth) <MergedFilth filths=[<NameFilth text='Jessica Sims' beg=112 end=124 detector_name='spacy_name' locale='en_US'>, <TaggedEvaluationFilth text='Jessica Sims' beg=112 end=124 comparison_type='name' detector_name='tagged' locale='en_US'>]> <MergedFilth filths=[<NameFilth text='Michelle Mayer' beg=295 end=309 detector_name='spacy_name' locale='en_US'>, <TaggedEvaluationFilth text='Michelle Mayer' beg=295 end=309 comparison_type='name' detector_name='tagged' locale='en_US'>]> <NameFilth text='nature activity' beg=363 end=378 detector_name='spacy_name' locale='en_US'> <MergedFilth filths=[<NameFilth text='Claudia Carroll' beg=495 end=510 detector_name='spacy_name' locale='en_US'>, <TaggedEvaluationFilth text='Claudia Carroll' beg=495 end=510 comparison_type='name' detector_name='tagged' locale='en_US'>]> <MergedFilth filths=[<NameFilth text='Laura Smith' beg=675 end=686 detector_name='spacy_name' locale='en_US'>, <TaggedEvaluationFilth text='Laura Smith' beg=675 end=686 comparison_type='name' detector_name='tagged' locale='en_US'>]> >>> print(scrubadub.comparison.get_filth_classification_report(filth_list)) filth detector locale precision recall f1-score support name spacy_name en_US 0.80 1.00 0.89 4 micro avg 0.80 1.00 0.89 4 macro avg 0.80 1.00 0.89 4 weighted avg 0.80 1.00 0.89 4
API reference¶
Below is the API reference for some of the functions mentioned on this page.
- scrubadub.comparison.get_filth_classification_report(filth_list: List[scrubadub.filth.base.Filth], combine_detectors: bool = False, groupby_documents: bool = False, output_dict: bool = False) Optional[Union[str, Dict[str, float]]] [source]
Evaluates the performance of detectors using KnownFilth.
An example of using this is shown below:
>>> import scrubadub, scrubadub.comparison, scrubadub.detectors.text_blob >>> scrubber = scrubadub.Scrubber(detector_list=[ ... scrubadub.detectors.text_blob.TextBlobNameDetector(name='name_detector'), ... scrubadub.detectors.TaggedEvaluationFilthDetector([ ... {'match': 'Tom', 'filth_type': 'name'}, ... {'match': 'tom@example.com', 'filth_type': 'email'}, ... ]), ... ]) >>> filth_list = list(scrubber.iter_filth("Hello I am Tom")) >>> print(scrubadub.comparison.get_filth_classification_report(filth_list)) filth detector locale precision recall f1-score support name name_detector en_US 1.00 1.00 1.00 1 accuracy 1.00 1 macro avg 1.00 1.00 1.00 1 weighted avg 1.00 1.00 1.00 1
- Parameters
filth_list (A list of Filth objects) – The list of detected filth
combine_detectors (bool, optional) – Combine performance of all detectors for the same filth/locale
groupby_documents (bool, optional) – Show performance for each file individually
output_dict (bool, optional) – Return the report in JSON format, defautls to False
- Returns
The report in JSON (a dict) or in plain text
- Return type
str or dict
- scrubadub.comparison.get_filth_dataframe(filth_list: List[scrubadub.filth.base.Filth]) pandas.core.frame.DataFrame [source]
Produces a pandas DataFrame to allow debugging and improving detectors.
An example of using this is shown below:
>>> import scrubadub, scrubadub.comparison, scrubadub.detectors.text_blob >>> scrubber = scrubadub.Scrubber(detector_list=[ ... scrubadub.detectors.text_blob.TextBlobNameDetector(name='name_detector'), ... scrubadub.detectors.TaggedEvaluationFilthDetector([ ... {'match': 'Tom', 'filth_type': 'name'}, ... {'match': 'tom@example.com', 'filth_type': 'email'}, ... ]), ... ]) >>> filth_list = list(scrubber.iter_filth("Hello I am Tom")) >>> with pd.option_context("display.max_columns", 20): ... print(scrubadub.comparison.get_filth_dataframe(filth_list)) group_id filth_id filth_type detector_name document_name text beg end \ 0 0 0 name name_detector None Tom 11 14 locale known_filth comparison_type known_text known_beg known_end \ 0 en_US True NaN Tom 11 14 known_comparison_type exact_match partial_match true_positive \ 0 name True True True false_positive false_negative 0 False False
- Parameters
filth_list (A list of Filth objects) – The list of detected filth
- Returns
A pd.DataFrame containing infomatoin about the detected Filth
- Return type
pd.DataFrame
- scrubadub.comparison.make_fake_document(paragraphs: int = 20, locale: str = 'en_US', seed: Optional[int] = None, faker: Optional[faker.proxy.Faker] = None, filth_types: Optional[List[str]] = None, fake_text_function: Optional[Callable[[...], str]] = None, additional_filth_types: Optional[Iterable[Type[scrubadub.filth.base.Filth]]] = None) Tuple[str, List[scrubadub.detectors.tagged.KnownFilthItem]] [source]
Creates a fake document containing Filth that needs to be removed. Also returns the list of known filth items that are needed by the TaggedEvaluationFilthDetector.
An example of using this is shown below:
>>> import scrubadub, scrubadub.comparison >>> document, known_filth_items = scrubadub.comparison.make_fake_document(paragraphs=1, seed=1) >>> scrubber = scrubadub.Scrubber() >>> scrubber.add_detector(scrubadub.detectors.TaggedEvaluationFilthDetector( ... known_filth_items=known_filth_items ... )) >>> filth_list = list(scrubber.iter_filth(document)) >>> print(scrubadub.comparison.get_filth_classification_report(filth_list)) filth detector locale precision recall f1-score support email email en_US 1.00 1.00 1.00 2 url url en_US 1.00 1.00 1.00 1 micro avg 1.00 1.00 1.00 3 macro avg 1.00 1.00 1.00 3 weighted avg 1.00 1.00 1.00 3 samples avg 1.00 1.00 1.00 3
- Parameters
paragraphs (int) – The list of detected filth
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”
seed (int, optional) – The random seed used to generate the document
faker (int) – A Faker object that is used to generate the text
filth_types (List[str]) – A list of the
Filth.type
to generatefake_text_function (Callable, optional) – A function that will generate a 1-3 sentances of text
- Returns
The document and a list of KnownFilthItems
- Return type
Tuple[str, List[KnownFilthItem]]