scrubadub.detectors¶

scrubadub consists of several Detector’s, which are responsible for identifying and iterating over the Filth that can be found in a piece of text.

Base classes¶

Every Detector that inherits from scrubadub.detectors.Detector.

scrubadub.detectors.Detector¶

class scrubadub.detectors.Detector(name: Optional[str] = None, locale: str = 'en_US')[source]¶

Bases: object

This is the base class for all detectors.

A simple example of how to make a new detector is given below:

>>> import scrubadub
>>> class MyFilth(scrubadub.filth.Filth):
...     type = 'mine'
>>> class MyDetector(scrubadub.detectors.Detector):
...     name = 'my_fr_detector'
...     def iter_filth(self, text, document_name=None):
...         # This detector always returns this same Filth no matter the input.
...         # You should implement something better here.
...         yield MyFilth(beg=0, end=8, text='My stuff', document_name=document_name, detector_name=self.name)
>>> scrubber = scrubadub.Scrubber()
>>> scrubber.add_detector(MyDetector)
>>> text = "My stuff can be found there."
>>> scrubber.clean(text)
'{{MINE}} can be found there.'

You can also advertise a Detector as supporting a certain locale by defining the `Detector.supported_local()` function.

filth_cls¶: alias of scrubadub.filth.base.Filth

autoload: bool = False¶

__init__(name: Optional[str] = None, locale: str = 'en_US')[source]¶

Initialise the Detector.

Parameters

name (str, optional) – Overrides the default name of the :class:Detector
locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

name: str = 'detector'¶

static locale_transform(locale: str) → str¶

Normalise the locale string, e.g. ‘fr’ -> ‘fr_FR’.

Parameters: locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
Returns: The normalised locale string
Return type: str

static locale_split(locale: str) → Tuple[Optional[str], Optional[str]]¶

Split the locale string into the language and region.

Parameters: locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
Returns: The two-letter language code and the two-letter region code in a tuple.
Return type: tuple, (str, str)

iter_filth(text: str, document_name: Optional[str] = None) → Generator[scrubadub.filth.base.Filth, None, None][source]¶

Yields discovered filth in the provided text.

Parameters

text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

iter_filth_documents(document_list: Sequence[str], document_names: Sequence[Optional[str]]) → Generator[scrubadub.filth.base.Filth, None, None][source]¶

Yields discovered filth in a list of documents.

Parameters

document_list (List[str]) – A list of documents to clean.
document_names (List[str]) – A list containing the name of each document.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

scrubadub.detectors.RegexDetector¶

For convenience, there is also a RegexDetector, which makes it easy to quickly add new types of Filth that can be identified from regular expressions:

class scrubadub.detectors.RegexDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶

Bases: scrubadub.detectors.base.Detector

Base class to match PII with a regex.

This class requires that the filth_cls attribute be set to the class of the Filth that should be returned by this Detector.

>>> import re, scrubadub
>>> class NewUrlDetector(scrubadub.detectors.RegexDetector):
...     name = 'new_url_detector'
...     filth_cls = scrubadub.filth.url.UrlFilth
...     regex = re.compile(r'https.*$', re.IGNORECASE)
>>> scrubber = scrubadub.Scrubber(detector_list=[NewUrlDetector()])
>>> text = u"This url will be found https://example.com"
>>> scrubber.clean(text)
'This url will be found {{URL}}'

regex: Optional[Pattern[str]] = None¶

filth_cls¶: alias of scrubadub.filth.base.Filth

iter_filth(text: str, document_name: Optional[str] = None) → Generator[scrubadub.filth.base.Filth, None, None][source]¶

Yields discovered filth in the provided text.

Parameters

text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

scrubadub.detectors.RegionLocalisedRegexDetector¶

class scrubadub.detectors.RegionLocalisedRegexDetector(**kwargs)[source]¶

Bases: scrubadub.detectors.base.RegexDetector

Detector to detect Filth localised using regular expressions localised by the region

region_regex: Dict[str, Pattern] = {}¶

__init__(**kwargs)[source]¶

Initialise the Detector.

Parameters

name (str, optional) – Overrides the default name of the :class:Detector
locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

classmethod supported_locale(locale: str) → bool[source]¶

Returns true if this Detector supports the given locale.

Parameters: locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
Returns: True if the locale is supported, otherwise False
Return type: bool

Detectors enabled by default¶

These are the detectors that are enabled in the scrubber by default.

scrubadub.detectors.CredentialDetector¶

class scrubadub.detectors.CredentialDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶

Bases: scrubadub.detectors.base.RegexDetector

Remove username/password combinations from dirty drity text.

regex: Optional[Pattern[str]]¶: Compiled regular expression object.

filth_cls¶: alias of scrubadub.filth.credential.CredentialFilth

name: str = 'credential'¶

autoload: bool = True¶

scrubadub.detectors.CreditCardDetector¶

class scrubadub.detectors.CreditCardDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶

Bases: scrubadub.detectors.base.RegexDetector

Remove credit-card numbers from dirty dirty text.

Supports Visa, MasterCard, American Express, Diners Club and JCB.

regex: Optional[Pattern[str]]¶: Compiled regular expression object.

name: str = 'credit_card'¶

filth_cls¶: alias of scrubadub.filth.credit_card.CreditCardFilth

autoload: bool = True¶

scrubadub.detectors.DriversLicenceDetector¶

class scrubadub.detectors.DriversLicenceDetector(**kwargs)[source]¶

Bases: scrubadub.detectors.base.RegionLocalisedRegexDetector

Use regular expressions to detect UK driving licence numbers, Simple pattern matching, no checksum solution.

region_regex: Dict[str, Pattern]¶

name: str = 'drivers_licence'¶

autoload: bool = True¶

filth_cls¶: alias of scrubadub.filth.drivers_licence.DriversLicenceFilth

scrubadub.detectors.EmailDetector¶

class scrubadub.detectors.EmailDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶

Bases: scrubadub.detectors.base.RegexDetector

Use regular expression magic to remove email addresses from dirty dirty text. This method also catches email addresses like john at gmail.com.

regex: Optional[Pattern[str]]¶: Compiled regular expression object.

filth_cls¶: alias of scrubadub.filth.email.EmailFilth

name: str = 'email'¶

autoload: bool = True¶

at_matcher = re.compile('@|\\sat\\s', re.IGNORECASE)¶

dot_matcher = re.compile('\\.|\\sdot\\s', re.IGNORECASE)¶

iter_filth(text: str, document_name: Optional[str] = None) → Generator[scrubadub.filth.base.Filth, None, None][source]¶

Yields discovered filth in the provided text.

Parameters

text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

scrubadub.detectors.en_GB.NationalInsuranceNumberDetector¶

class scrubadub.detectors.en_GB.NationalInsuranceNumberDetector(**kwargs)[source]¶

Bases: scrubadub.detectors.base.RegionLocalisedRegexDetector

Use regular expressions to remove the GB National Insurance number (NINO), Simple pattern matching, no checksum solution.

region_regex: Dict[str, Pattern]¶

name: str = 'national_insurance_number'¶

autoload: bool = True¶

filth_cls¶: alias of scrubadub.filth.en_GB.national_insurance_number.NationalInsuranceNumberFilth

scrubadub.detectors.PhoneDetector¶

class scrubadub.detectors.PhoneDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶

Bases: scrubadub.detectors.base.Detector

Remove phone numbers from dirty dirty text using python-phonenumbers, a port of a Google project to correctly format phone numbers in text.

Set the locale on the scrubber or detector to set the region used to search for valid phone numbers. If the locale is set to ‘en_CA’ Canadian numbers will be searched for, while setting the local to ‘en_GB’ searches for British numbers.

filth_cls¶: alias of scrubadub.filth.phone.PhoneFilth

name: str = 'phone'¶

autoload: bool = True¶

iter_filth(text, document_name: Optional[str] = None)[source]¶

Yields discovered filth in the provided text.

Parameters

text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

classmethod supported_locale(locale: str) → bool[source]¶

Returns true if this Detector supports the given locale.

Parameters: locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
Returns: True if the locale is supported, otherwise False
Return type: bool

scrubadub.detectors.PostalCodeDetector¶

class scrubadub.detectors.PostalCodeDetector(**kwargs)[source]¶

Bases: scrubadub.detectors.base.RegionLocalisedRegexDetector

Detects postal codes, currently only British post codes are supported.

region_regex: Dict[str, Pattern]¶

filth_cls¶: alias of scrubadub.filth.postalcode.PostalCodeFilth

name: str = 'postalcode'¶

autoload: bool = True¶

scrubadub.detectors.en_US.SocialSecurityNumberDetector¶

class scrubadub.detectors.en_US.SocialSecurityNumberDetector(**kwargs)[source]¶

Bases: scrubadub.detectors.base.RegionLocalisedRegexDetector

Use regular expressions to detect a social security number (SSN) in dirty dirty text.

region_regex: Dict[str, Pattern]¶

filth_cls¶: alias of scrubadub.filth.en_US.social_security_number.SocialSecurityNumberFilth

name: str = 'social_security_number'¶

autoload: bool = True¶

scrubadub.detectors.en_GB.TaxReferenceNumberDetector¶

class scrubadub.detectors.en_GB.TaxReferenceNumberDetector(**kwargs)[source]¶

Bases: scrubadub.detectors.base.RegionLocalisedRegexDetector

Use regular expressions to detect the UK PAYE temporary reference number (TRN), Simple pattern matching, no checksum solution.

region_regex: Dict[str, Pattern]¶

name: str = 'tax_reference_number'¶

autoload: bool = True¶

filth_cls¶: alias of scrubadub.filth.en_GB.tax_reference_number.TaxReferenceNumberFilth

scrubadub.detectors.TwitterDetector¶

class scrubadub.detectors.TwitterDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶

Bases: scrubadub.detectors.base.RegexDetector

Use regular expression magic to remove twitter usernames from dirty dirty text.

regex: Optional[Pattern[str]]¶: Compiled regular expression object.

filth_cls¶: alias of scrubadub.filth.twitter.TwitterFilth

name: str = 'twitter'¶

autoload: bool = True¶

scrubadub.detectors.UrlDetector¶

class scrubadub.detectors.UrlDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶

Bases: scrubadub.detectors.base.RegexDetector

Use regular expressions to remove URLs that begin with http://, https:// or www. from dirty dirty text.

With keep_domain=True, this detector only obfuscates the path on a URL, not its domain. For example, http://twitter.com/someone/status/234978haoin becomes http://twitter.com/{{replacement}}.

regex: Optional[Pattern[str]]¶: Compiled regular expression object.

filth_cls¶: alias of scrubadub.filth.url.UrlFilth

name: str = 'url'¶

autoload: bool = True¶

scrubadub.detectors.VehicleLicencePlateDetector¶

class scrubadub.detectors.VehicleLicencePlateDetector(**kwargs)[source]¶

Bases: scrubadub.detectors.base.RegionLocalisedRegexDetector

Detects standard british licence plates.

region_regex: Dict[str, Pattern]¶

filth_cls¶: alias of scrubadub.filth.vehicle_licence_plate.VehicleLicencePlateFilth

name: str = 'vehicle_licence_plate'¶

autoload: bool = True¶

Optional detectors¶

These detectors need to be manually added to a Scrubber, they are not loaded automatically. An example is shown below that demonstrates the various ways that a detector can be added to a Scrubber:

>>> import scrubadub
>>> scrubber = scrubadub.Scrubber(detector_list=[scrubadub.detectors.TextBlobNameDetector()])
>>> scrubber.add_detector(scrubadub.detectors.CreditCardDetector)
>>> scrubber.add_detector('skype')
>>> detector = scrubadub.detectors.DateOfBirthDetector(require_context=True)
>>> scrubber.add_detector(detector)

For further information see the usage page.

scrubadub.detectors.DateOfBirthDetector¶

class scrubadub.detectors.DateOfBirthDetector(context_before: int = 2, context_after: int = 1, require_context: bool = True, context_words: Optional[List[str]] = None, **kwargs)[source]¶

Bases: scrubadub.detectors.base.Detector

This detector aims to detect dates of birth in text.

First all possible dates are found, then they are filtered to those that would result in people being between DateOfBirthFilth.min_age_years and DateOfBirthFilth.max_age_years, which default to 18 and 100 respectively.

If require_context is True, we search for one of the possible context_words near the found date. We search up to context_before lines before the date and up to context_after lines after the date. The context that we search for are terms like ‘birth’ or ‘DoB’ to increase the likelihood that the date is indeed a date of birth. The context words can be set using the context_words parameter, which expects a list of strings.

>>> import scrubadub, scrubadub.detectors.date_of_birth
>>> DateOfBirthFilth.min_age_years = 12
>>> scrubber = scrubadub.Scrubber(detector_list=[
...     scrubadub.detectors.date_of_birth.DateOfBirthDetector(),
... ])
>>> scrubber.clean("I was born on 10-Nov-2008.")
'I was born {{DATE_OF_BIRTH}}.'

name: str = 'date_of_birth'¶

filth_cls¶: alias of scrubadub.filth.date_of_birth.DateOfBirthFilth

autoload: bool = False¶

context_words_language_map = {'de': ['geburt', 'geboren', 'geb', 'geb.'], 'en': ['birth', 'born', 'dob', 'd.o.b.']}¶

__init__(context_before: int = 2, context_after: int = 1, require_context: bool = True, context_words: Optional[List[str]] = None, **kwargs)[source]¶

Initialise the detector.

Parameters

context_before (int) – The number of lines of context to search before the date
context_after (int) – The number of lines of context to search after the date
require_context (bool) – Set to False if your dates of birth are not near words that provide context (such as “birth” or “DOB”).
context_words (bool) – A list of words that provide context related to dates of birth, such as the following: ‘birth’, ‘born’, ‘dob’ or ‘d.o.b.’.
name (str, optional) – Overrides the default name of the :class:Detector
locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

iter_filth(text: str, document_name: Optional[str] = None) → Generator[scrubadub.filth.base.Filth, None, None][source]¶

Search text for Filth and return a generator of Filth objects.

Parameters

text (str) – The dirty text that this Detector should search
document_name (Optional[str]) – Name of the document this is being passed to this detector

Returns

The found Filth in the text

Return type

Generator[Filth]

classmethod supported_locale(locale: str) → bool[source]¶

Returns true if this Detector supports the given locale.

Parameters: locale (str) – The locale of the documents in the format: 2 letter lower-case language code eg “en”, “es”.
Returns: True if the locale is supported, otherwise False
Return type: bool

scrubadub.detectors.SkypeDetector¶

class scrubadub.detectors.SkypeDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶

Bases: scrubadub.detectors.base.RegexDetector

Skype usernames tend to be used inline in dirty dirty text quite often but also appear as skype: {{SKYPE}} quite a bit. This method looks at words within word_radius words of “skype” for things that appear to be misspelled or have punctuation in them as a means to identify skype usernames.

Default word_radius is 10, corresponding with the rough scale of half of a sentence before or after the word “skype” is used. Increasing the word_radius will increase the false positive rate and decreasing the word_radius will increase the false negative rate.

filth_cls¶: alias of scrubadub.filth.skype.SkypeFilth

name: str = 'skype'¶

autoload: bool = False¶

word_radius = 10¶

SKYPE_TOKEN = '[a-zA-Z][a-zA-Z0-9_\\-\\,\\.]+'¶

SKYPE_USERNAME = re.compile('[a-zA-Z][a-zA-Z0-9_\\-\\,\\.]{5,31}')¶

iter_filth(text, document_name: Optional[str] = None) → Generator[scrubadub.filth.base.Filth, None, None][source]¶

Yields discovered filth in the provided text.

Parameters

text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

scrubadub.detectors.TaggedEvaluationFilthDetector¶

class scrubadub.detectors.TaggedEvaluationFilthDetector(known_filth_items: List[scrubadub.detectors.tagged.KnownFilthItem], **kwargs)[source]¶

Bases: scrubadub.detectors.base.Detector

Use this Detector to find tag filth as true Filth. This is useful when you want evaluate the effectiveness of a Detector using Filth that has been selected by a human.

Results from this detector are used as the “truth” against which the other detectos are compared. This is done in scrubadub.comparison.get_filth_classification_report where the detecton accuracies are calculated.

An example of how to use this detector is given below:

>>> import scrubadub, scrubadub.comparison, scrubadub.detectors.text_blob
>>> scrubber = scrubadub.Scrubber(detector_list=[
...     scrubadub.detectors.text_blob.TextBlobNameDetector(name='name_detector'),
...     scrubadub.detectors.TaggedEvaluationFilthDetector([
...         {'match': 'Tom', 'filth_type': 'name'},
...         {'match': 'tom@example.com', 'filth_type': 'email'},
...     ]),
... ])
>>> filth_list = list(scrubber.iter_filth("Hello I am Tom"))
>>> print(scrubadub.comparison.get_filth_classification_report(filth_list))
filth    detector         locale      precision    recall  f1-score   support

name     name_detector    en_US            1.00      1.00      1.00         1

                            accuracy                           1.00         1
                           macro avg       1.00      1.00      1.00         1
                        weighted avg       1.00      1.00      1.00         1

This detector takes a list of dictonaires (reffered to as known filth items). These specify what to look for in the text to label as tagged filth. The dictionary should contain the following keys:

match (str) - a string value that will be searched for in the text

filth_type (str) - a string value that indicates the type of Filth, should be set to Filth.name. An example of these could be ‘name’ or ‘phone’ for name and phone filths respectively.

The known filth item dictionary may also optionally contain:

match_end (str) - if specified will search for Filth starting with the value of match and ending with the value of match_end

limit (int) - an integer describing the maximum number of characters between match and match_end, defaults to 150

ignore_case (bool) - Ignore case when searching for the tagged filth

ignore_whitespace (bool) - Ignore whitespace when matching (“asd qwe” can also match “asd\nqwe”)

ignore_partial_word_matches (bool) - Ignore matches that are only partial words (if you’re looking for “Eve”, this flag ensure it wont match “Evening”)

Examples of this:

{'match': 'aaa', 'filth_type': 'name'} - will search for an exact match to aaa and return it as a NameFilth

{'match': 'aaa', 'match_end': 'zzz', 'filth_type': 'name'} - will search for aaa followed by up to 150 characters followed by zzz, which would match both aaabbbzzz and aaazzz.

{'match': '012345', 'filth_type': 'phone', 'ignore_partial_word_matches': True} - will search for an exact match to 012345, ignoring any partial matches and return it as a PhoneFilth

This detector is not enabled by default (since you need to supply a list of known filths) and so you must always add it to your scrubber with a scrubber.add_detector(detector) call or by adding it to the detector_list inialising a Scrubber.

filth_cls¶: alias of scrubadub.filth.tagged.TaggedEvaluationFilth

name: str = 'tagged'¶

autoload: bool = False¶

__init__(known_filth_items: List[scrubadub.detectors.tagged.KnownFilthItem], **kwargs)[source]¶

Initialise the Detector.

Parameters

known_filth_items (list of dicts) – A list of dictionaries that describe items to be searched for in the dirty text. The keys match and filth_type are required, which give the text to be searched for and the type of filth that the match string represents. See the class docstring for further details of available flags in this dictionary.
tagged_filth (bool, default True) – Whether the filth has been tagged and should be used as truth when calculating filth finding accuracies.
name (str, optional) – Overrides the default name of the :class:Detector
locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

static dedup_dicts(known_filth_items: List[scrubadub.detectors.tagged.KnownFilthItem]) → List[scrubadub.detectors.tagged.KnownFilthItem][source]¶

create_filth(start_location: int, end_location: int, text: str, comparison_type: Optional[str], detector_name: str, document_name: Optional[str], locale: str) → scrubadub.filth.base.Filth[source]¶

iter_filth(text: str, document_name: Optional[str] = None) → Generator[scrubadub.filth.base.Filth, None, None][source]¶

Yields discovered filth in the provided text.

Parameters

text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

scrubadub.detectors.TextBlobNameDetector¶

class scrubadub.detectors.TextBlobNameDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶

Bases: scrubadub.detectors.base.RegexDetector

Use part of speech tagging from textblob to clean proper nouns out of the dirty dirty text. Disallow particular nouns by adding them to the NameDetector.disallowed_nouns set.

filth_cls¶: alias of scrubadub.filth.name.NameFilth

name: str = 'text_blob_name'¶

autoload: bool = False¶

disallowed_nouns = {'skype'}¶

iter_filth(text, document_name: Optional[str] = None) → Generator[scrubadub.filth.base.Filth, None, None][source]¶

Yields discovered filth in the provided text.

Parameters

text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

classmethod supported_locale(locale: str) → bool[source]¶

Returns true if this Detector supports the given locale.

Parameters: locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
Returns: True if the locale is supported, otherwise False
Return type: bool

scrubadub.detectors.UserSuppliedFilthDetector¶

class scrubadub.detectors.UserSuppliedFilthDetector(known_filth_items: List[scrubadub.detectors.tagged.KnownFilthItem], **kwargs)[source]¶

Bases: scrubadub.detectors.tagged.TaggedEvaluationFilthDetector

Use this Detector to find some known filth in the text. An example might be if you have a list of employee numbers that you wish to remove from a document, as shown below:

>>> import scrubadub
>>> scrubber = scrubadub.Scrubber(detector_list=[
...     scrubadub.detectors.UserSuppliedFilthDetector([
...         {'match': 'Anika', 'filth_type': 'name'},
...         {'match': 'Larry', 'filth_type': 'name'},
...     ]),
... ])
>>> scrubber.clean("Anika is my favourite employee.")
'{{NAME}} is my favourite employee.'

This detector takes a list of dictonaires (reffered to as known filth items). These specify what to look for in the text to label as tagged filth. The dictionary should contain the following keys:

match (str) - a string value that will be searched for in the text

filth_type (str) - a string value that indicates the type of Filth, should be set to Filth.name. An example of these could be ‘name’ or ‘phone’ for name and phone filths respectively.

The known filth item dictionary may also optionally contain:

match_end (str) - if specified will search for Filth starting with the value of match and ending with the value of match_end

limit (int) - an integer describing the maximum number of characters between match and match_end, defaults to 150

ignore_case (bool) - Ignore case when searching for the tagged filth

ignore_whitespace (bool) - Ignore whitespace when matching (“asd qwe” can also match “asd\nqwe”)

ignore_partial_word_matches (bool) - Ignore matches that are only partial words (if you’re looking for “Eve”, this flag ensure it wont match “Evening”)

Examples of this:

{'match': 'aaa', 'filth_type': 'name'} - will search for an exact match to aaa and return it as a NameFilth

{'match': 'aaa', 'match_end': 'zzz', 'filth_type': 'name'} - will search for aaa followed by up to 150 characters followed by zzz, which would match both aaabbbzzz and aaazzz.

{'match': '012345', 'filth_type': 'phone', 'ignore_partial_word_matches': True} - will search for an exact match to 012345, ignoring any partial matches and return it as a PhoneFilth

This detector is not enabled by default (since you need to supply a list of known filths) and so you must always add it to your scrubber with a scrubber.add_detector(detector) call or by adding it to the detector_list inialising a Scrubber.

name: str = 'user_supplied'¶

create_filth(start_location: int, end_location: int, text: str, comparison_type: Optional[str], detector_name: str, document_name: Optional[str], locale: str) → scrubadub.filth.base.Filth[source]¶

External detectors¶

These are detectors that are not included in the scrubadub package, usually because they come with large external dependencies that are not always needed. To use them you should first import their package and then add them to the Scrubber, an example of this is shown below:

>>> import scrubadub, scrubadub_address
>>> scrubber = scrubadub.Scrubber()
>>> scrubber.add_detector(scrubadub_address.detectors.AddressDetector)

scrubadub_address.detectors.AddressDetector¶

scrubadub_spacy.detectors.SpacyEntityDetector¶

class scrubadub_spacy.detectors.SpacyEntityDetector(named_entities: Optional[Iterable[str]] = None, model: Optional[str] = None, **kwargs)[source]¶

Bases: scrubadub.detectors.base.Detector

Use spaCy’s named entity recognition to identify possible Filth.

This detector is made to work with v3 of spaCy, since the NER model has been significantly improved in this version.

This is particularly useful to remove names from text, but can also be used to remove any entity that is recognised by spaCy. A full list of entities that spacy supports can be found here: https://spacy.io/api/annotation#named-entities.

Additional entities can be added like so:

>>> import scrubadub, scrubadub_spacy
>>> class MoneyFilth(scrubadub.filth.Filth):
...     type = 'money'
>>> scrubadub_spacy.detectors.spacy.SpacyEntityDetector.filth_cls_map['MONEY'] = MoneyFilth
>>> detector = scrubadub_spacy.detectors.spacy.SpacyEntityDetector(named_entities=['MONEY'])
>>> scrubber = scrubadub.Scrubber(detector_list=[detector])
>>> scrubber.clean("You owe me 12 dollars man!")
'You owe me {{MONEY}} man!'

The dictonary scrubadub_spacy.detectors.spacy.SpacyEntityDetector.filth_cls_map is used to map between the spaCy named entity label and the type of scrubadub Filth, while the named_entities argument sets which named entities are considered Filth by the SpacyEntityDetector.

filth_cls_map = {'DATE': <class 'scrubadub.filth.date_of_birth.DateOfBirthFilth'>, 'FAC': <class 'scrubadub.filth.location.LocationFilth'>, 'GPE': <class 'scrubadub.filth.location.LocationFilth'>, 'LOC': <class 'scrubadub.filth.location.LocationFilth'>, 'ORG': <class 'scrubadub.filth.organization.OrganizationFilth'>, 'PER': <class 'scrubadub.filth.name.NameFilth'>, 'PERSON': <class 'scrubadub.filth.name.NameFilth'>}¶

name: str = 'spacy'¶

language_to_model = {'de': 'de_dep_news_trf', 'en': 'en_core_web_trf', 'es': 'es_dep_news_trf', 'fr': 'fr_dep_news_trf', 'nl': 'nl_core_news_trf', 'zh': 'zh_core_web_trf'}¶

disallowed_nouns = {'skype'}¶

__init__(named_entities: Optional[Iterable[str]] = None, model: Optional[str] = None, **kwargs)[source]¶

Initialise the Detector.

Parameters

named_entities (Iterable[str], optional) – Limit the named entities to those in this list, defaults to {'PERSON', 'PER', 'ORG'}
model (str, optional) – The name of the spacy model to use, it must contain a ‘ner’ step in the model pipeline (most do, but not all).
name (str, optional) – Overrides the default name of the :class:Detector
locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

static check_spacy_version() → bool[source]¶: Ensure that the version od spaCy is v3.

static check_spacy_model(model) → bool[source]¶: Ensure that the spaCy model is installed.

iter_filth_documents(document_list: Sequence[str], document_names: Sequence[Optional[str]]) → Generator[scrubadub.filth.base.Filth, None, None][source]¶

Yields discovered filth in a list of documents.

Parameters

document_list (List[str]) – A list of documents to clean.
document_names (List[str]) – A list containing the name of each document.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

iter_filth(text: str, document_name: Optional[str] = None) → Generator[scrubadub.filth.base.Filth, None, None][source]¶

Yields discovered filth in the provided text.

Parameters

text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

classmethod supported_locale(locale: str) → bool[source]¶

Returns true if this Detector supports the given locale.

Parameters: locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
Returns: True if the locale is supported, otherwise False
Return type: bool

scrubadub_spacy.detectors.SpacyNameDetector¶

class scrubadub_spacy.detectors.SpacyNameDetector(include_spacy: bool = True, **kwargs)[source]¶

Bases: scrubadub_spacy.detectors.spacy.SpacyEntityDetector

Add an extension to the spacy detector to look for tokens that often occur before or after names of people’s names, a prefix might be Hello as in “Hello Jane”, or Mrs as in “Mrs Jane Smith” and a suffix could be PhD as in “Jane Smith PhD”.

See the SpacyDetector for further info on how to use this detector as it shares many similar options.

Currently only english prefixes and sufixes are supported, but other language titles can be easily added, as in the example below:

>>> import scrubadub, scrubadub_spacy
>>> scrubadub_spacy.detectors.spacy_name_title.SpacyNameDetector.NOUN_TAGS['de'] = ['NN', 'NE', 'NNE']
>>> scrubadub_spacy.detectors.spacy_name_title.SpacyNameDetector.NAME_PREFIXES['de'] = ['frau', 'herr']
>>> detector = scrubadub_spacy.detectors.spacy_name_title.SpacyNameDetector(locale='de_DE', model='de_core_news_sm',
...     include_spacy=False)
>>> scrubber = scrubadub.Scrubber(detector_list=[detector], locale='de_DE')
>>> scrubber.clean("bleib dort Frau Schmidt")
'bleib dort {{NAME+NAME}}'

name: str = 'spacy_name'¶

NAME_PREFIXES = {'en': ['mr', 'mr.', 'mister', 'mrs', 'mrs.', 'misses', 'ms', 'ms.', 'miss', 'dr', 'dr.', 'doctor', 'prof', 'prof.', 'professor', 'lord', 'lady', 'rev', 'rev.', 'reverend', 'hon', 'hon.', 'honourable', 'hhj', 'honorable', 'judge', 'sir', 'madam', 'hello', 'dear', 'hi', 'hey', 'regards', 'to:', 'from:', 'sender:']}¶

NAME_SUFFIXES = {'en': ['phd', 'bsc', 'msci', 'ba', 'md', 'qc', 'ma', 'mba']}¶

NOUN_TAGS = {'en': ['NNP', 'NN', 'NNPS']}¶

TOKEN_SEARCH_DISTANCE = 3¶

MINIMUM_NAME_LENGTH = 1¶

__init__(include_spacy: bool = True, **kwargs)[source]¶

Initialise the Detector.

Parameters

include_spacy (bool, default, False) – include default spacy library in addition to title detector.
named_entities (Iterable[str], optional) – Limit the named entities to those in this list, defaults to {'PERSON', 'PER', 'ORG'}.
model (str, optional) – The name of the spacy model to use, it must contain a ‘ner’ step in the model pipeline (most do, but not all).
name (str, optional) – Overrides the default name of the :class:Detector
locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

static find_names(doc: spacy.tokens.doc.Doc, tokens: Sequence[spacy.tokens.token.Token], noun_tags: List[str]) → spacy.tokens.doc.Doc[source]¶: This function searches for possilbe names in a flagged set of tokens and adds them to the identified entities.

iter_filth_documents(document_list: Sequence[str], document_names: Sequence[Optional[str]]) → Generator[scrubadub.filth.base.Filth, None, None][source]¶

Yields discovered filth in a list of documents.

Parameters

document_list (List[str]) – A list of documents to clean.
document_names (List[str]) – A list containing the name of each document.

Returns

A list containing all the spacy doc

Return type

Sequence[Optional[str]]

classmethod supported_locale(locale: str) → bool[source]¶

Returns true if this Detector supports the given locale.

Parameters: locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
Returns: True if the locale is supported, otherwise False
Return type: bool

scrubadub_stanford.detectors.StanfordEntityDetector¶

class scrubadub_stanford.detectors.StanfordEntityDetector(enable_person: bool = True, enable_organization: bool = True, enable_location: bool = False, **kwargs)[source]¶

Bases: scrubadub.detectors.base.Detector

Search for people’s names, organization’s names and locations within text using the stanford 3 class model.

The three classes of this model can be enabled with the three arguments to the inialiser enable_person, enable_organization and enable_location. An example of their usage is given below.

>>> import scrubadub, scrubadub_stanford
>>> detector = scrubadub_stanford.detectors.StanfordEntityDetector(
...     enable_person=False, enable_organization=False, enable_location=True
... )
>>> scrubber = scrubadub.Scrubber(detector_list=[detector])
>>> scrubber.clean('Jane is visiting London.')
'Jane is visiting {{LOCATION}}.'

filth_cls¶: alias of scrubadub.filth.base.Filth

name: str = 'stanford'¶

ignored_words = ['tennant']¶

stanford_version = '4.0.0'¶

stanford_download_url = 'https://nlp.stanford.edu/software/stanford-ner-{version}.zip'¶

__init__(enable_person: bool = True, enable_organization: bool = True, enable_location: bool = False, **kwargs)[source]¶

Initialise the Detector.

Parameters

name (str, optional) – Overrides the default name of the :class:Detector
locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

iter_filth(text, document_name: Optional[str] = None)[source]¶

Yields discovered filth in the provided text.

Parameters

text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

classmethod supported_locale(locale: str) → bool[source]¶

Returns true if this Detector supports the given locale.

Parameters: locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
Returns: True if the locale is supported, otherwise False
Return type: bool

Catalogue functions¶

These functions register or remove Detectors from the Detector catalogue.

scrubadub.detectors.register_detector¶

scrubadub.detectors.register_detector(detector: Type[Detector], *, autoload: Optional[bool] = None) → Type[Detector][source]¶

Register a detector for use with the Scrubber class.

You can use register_detector(NewDetector, autoload=True) after your detector definition to automatically register it with the Scrubber class so that it can be used to remove Filth.

The argument autoload``decides whether a new ``Scrubber() instance should load this detector by default.

>>> import scrubadub
>>> class NewDetector(scrubadub.detectors.Detector):
...     pass
>>> scrubadub.detectors.register_detector(NewDetector, autoload=False)
<class 'scrubadub.detectors.catalogue.NewDetector'>

Parameters

detector (Detector class) – The Detector to register with the scrubadub detector configuration.
autoload (Optional[bool]) – Whether to automatically load this Detector on Scrubber initialisation.

scrubadub.detectors.remove_detector¶

scrubadub.detectors.remove_detector(detector: Union[Type[Detector], str])[source]¶

Remove an already registered detector.

>>> import scrubadub
>>> class NewDetector(scrubadub.detectors.Detector):
...     pass
>>> scrubadub.detectors.catalogue.register_detector(NewDetector, autoload=False)
<class 'scrubadub.detectors.catalogue.NewDetector'>
>>> scrubadub.detectors.catalogue.remove_detector(NewDetector)

Parameters

detector (Union[Type['PostProcessor'], str]) – The Detector to register with the scrubadub detector configuration.
autoload (bool) – Whether to automatically load this Detector on Scrubber initialisation.