scrubadub.detectors¶
scrubadub
consists of several Detector
’s, which are responsible for
identifying and iterating over the Filth
that can be found in a piece of
text.
Base classes¶
Every Detector
that inherits from scrubadub.detectors.Detector
.
scrubadub.detectors.Detector¶
- class scrubadub.detectors.Detector(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Bases:
object
This is the base class for all detectors.
A simple example of how to make a new detector is given below:
>>> import scrubadub >>> class MyFilth(scrubadub.filth.Filth): ... type = 'mine' >>> class MyDetector(scrubadub.detectors.Detector): ... name = 'my_fr_detector' ... def iter_filth(self, text, document_name=None): ... # This detector always returns this same Filth no matter the input. ... # You should implement something better here. ... yield MyFilth(beg=0, end=8, text='My stuff', document_name=document_name, detector_name=self.name) >>> scrubber = scrubadub.Scrubber() >>> scrubber.add_detector(MyDetector) >>> text = "My stuff can be found there." >>> scrubber.clean(text) '{{MINE}} can be found there.'
You can also advertise a
Detector
as supporting a certain locale by defining the`Detector.supported_local()`
function.- filth_cls¶
alias of
scrubadub.filth.base.Filth
- autoload: bool = False¶
- __init__(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Initialise the
Detector
.- Parameters
name (str, optional) – Overrides the default name of the :class:
Detector
locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- name: str = 'detector'¶
- static locale_transform(locale: str) str ¶
Normalise the locale string, e.g. ‘fr’ -> ‘fr_FR’.
- Parameters
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- Returns
The normalised locale string
- Return type
str
- static locale_split(locale: str) Tuple[Optional[str], Optional[str]] ¶
Split the locale string into the language and region.
- Parameters
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- Returns
The two-letter language code and the two-letter region code in a tuple.
- Return type
tuple, (str, str)
- iter_filth(text: str, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None] [source]¶
Yields discovered filth in the provided
text
.- Parameters
text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.
- Returns
An iterator to the discovered
Filth
- Return type
Iterator[
Filth
]
- iter_filth_documents(document_list: Sequence[str], document_names: Sequence[Optional[str]]) Generator[scrubadub.filth.base.Filth, None, None] [source]¶
Yields discovered filth in a list of documents.
- Parameters
document_list (List[str]) – A list of documents to clean.
document_names (List[str]) – A list containing the name of each document.
- Returns
An iterator to the discovered
Filth
- Return type
Iterator[
Filth
]
scrubadub.detectors.RegexDetector¶
For convenience, there is also a RegexDetector
, which makes it easy to
quickly add new types of Filth
that can be identified from regular
expressions:
- class scrubadub.detectors.RegexDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Bases:
scrubadub.detectors.base.Detector
Base class to match PII with a regex.
This class requires that the
filth_cls
attribute be set to the class of theFilth
that should be returned by thisDetector
.>>> import re, scrubadub >>> class NewUrlDetector(scrubadub.detectors.RegexDetector): ... name = 'new_url_detector' ... filth_cls = scrubadub.filth.url.UrlFilth ... regex = re.compile(r'https.*$', re.IGNORECASE) >>> scrubber = scrubadub.Scrubber(detector_list=[NewUrlDetector()]) >>> text = u"This url will be found https://example.com" >>> scrubber.clean(text) 'This url will be found {{URL}}'
- regex: Optional[Pattern[str]] = None¶
- filth_cls¶
alias of
scrubadub.filth.base.Filth
- iter_filth(text: str, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None] [source]¶
Yields discovered filth in the provided
text
.- Parameters
text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.
- Returns
An iterator to the discovered
Filth
- Return type
Iterator[
Filth
]
scrubadub.detectors.RegionLocalisedRegexDetector¶
- class scrubadub.detectors.RegionLocalisedRegexDetector(**kwargs)[source]¶
Bases:
scrubadub.detectors.base.RegexDetector
Detector to detect
Filth
localised using regular expressions localised by the region- region_regex: Dict[str, Pattern] = {}¶
- __init__(**kwargs)[source]¶
Initialise the
Detector
.- Parameters
name (str, optional) – Overrides the default name of the :class:
Detector
locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- classmethod supported_locale(locale: str) bool [source]¶
Returns true if this
Detector
supports the given locale.- Parameters
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- Returns
True
if the locale is supported, otherwiseFalse
- Return type
bool
Detectors enabled by default¶
These are the detectors that are enabled in the scrubber by default.
scrubadub.detectors.CredentialDetector¶
- class scrubadub.detectors.CredentialDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Bases:
scrubadub.detectors.base.RegexDetector
Remove username/password combinations from dirty drity
text
.- regex: Optional[Pattern[str]]¶
Compiled regular expression object.
- filth_cls¶
alias of
scrubadub.filth.credential.CredentialFilth
- name: str = 'credential'¶
- autoload: bool = True¶
scrubadub.detectors.CreditCardDetector¶
- class scrubadub.detectors.CreditCardDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Bases:
scrubadub.detectors.base.RegexDetector
Remove credit-card numbers from dirty dirty
text
.Supports Visa, MasterCard, American Express, Diners Club and JCB.
- regex: Optional[Pattern[str]]¶
Compiled regular expression object.
- name: str = 'credit_card'¶
- filth_cls¶
alias of
scrubadub.filth.credit_card.CreditCardFilth
- autoload: bool = True¶
scrubadub.detectors.DriversLicenceDetector¶
- class scrubadub.detectors.DriversLicenceDetector(**kwargs)[source]¶
Bases:
scrubadub.detectors.base.RegionLocalisedRegexDetector
Use regular expressions to detect UK driving licence numbers, Simple pattern matching, no checksum solution.
- region_regex: Dict[str, Pattern]¶
- name: str = 'drivers_licence'¶
- autoload: bool = True¶
- filth_cls¶
alias of
scrubadub.filth.drivers_licence.DriversLicenceFilth
scrubadub.detectors.EmailDetector¶
- class scrubadub.detectors.EmailDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Bases:
scrubadub.detectors.base.RegexDetector
Use regular expression magic to remove email addresses from dirty dirty
text
. This method also catches email addresses likejohn at gmail.com
.- regex: Optional[Pattern[str]]¶
Compiled regular expression object.
- filth_cls¶
alias of
scrubadub.filth.email.EmailFilth
- name: str = 'email'¶
- autoload: bool = True¶
- at_matcher = re.compile('@|\\sat\\s', re.IGNORECASE)¶
- dot_matcher = re.compile('\\.|\\sdot\\s', re.IGNORECASE)¶
- iter_filth(text: str, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None] [source]¶
Yields discovered filth in the provided
text
.- Parameters
text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.
- Returns
An iterator to the discovered
Filth
- Return type
Iterator[
Filth
]
scrubadub.detectors.en_GB.NationalInsuranceNumberDetector¶
- class scrubadub.detectors.en_GB.NationalInsuranceNumberDetector(**kwargs)[source]¶
Bases:
scrubadub.detectors.base.RegionLocalisedRegexDetector
Use regular expressions to remove the GB National Insurance number (NINO), Simple pattern matching, no checksum solution.
- region_regex: Dict[str, Pattern]¶
- name: str = 'national_insurance_number'¶
- autoload: bool = True¶
- filth_cls¶
alias of
scrubadub.filth.en_GB.national_insurance_number.NationalInsuranceNumberFilth
scrubadub.detectors.PhoneDetector¶
- class scrubadub.detectors.PhoneDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Bases:
scrubadub.detectors.base.Detector
Remove phone numbers from dirty dirty
text
using python-phonenumbers, a port of a Google project to correctly format phone numbers in text.Set the locale on the scrubber or detector to set the region used to search for valid phone numbers. If the locale is set to ‘en_CA’ Canadian numbers will be searched for, while setting the local to ‘en_GB’ searches for British numbers.
- filth_cls¶
alias of
scrubadub.filth.phone.PhoneFilth
- name: str = 'phone'¶
- autoload: bool = True¶
- iter_filth(text, document_name: Optional[str] = None)[source]¶
Yields discovered filth in the provided
text
.- Parameters
text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.
- Returns
An iterator to the discovered
Filth
- Return type
Iterator[
Filth
]
- classmethod supported_locale(locale: str) bool [source]¶
Returns true if this
Detector
supports the given locale.- Parameters
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- Returns
True
if the locale is supported, otherwiseFalse
- Return type
bool
scrubadub.detectors.PostalCodeDetector¶
- class scrubadub.detectors.PostalCodeDetector(**kwargs)[source]¶
Bases:
scrubadub.detectors.base.RegionLocalisedRegexDetector
Detects postal codes, currently only British post codes are supported.
- region_regex: Dict[str, Pattern]¶
- filth_cls¶
alias of
scrubadub.filth.postalcode.PostalCodeFilth
- name: str = 'postalcode'¶
- autoload: bool = True¶
scrubadub.detectors.en_GB.TaxReferenceNumberDetector¶
- class scrubadub.detectors.en_GB.TaxReferenceNumberDetector(**kwargs)[source]¶
Bases:
scrubadub.detectors.base.RegionLocalisedRegexDetector
Use regular expressions to detect the UK PAYE temporary reference number (TRN), Simple pattern matching, no checksum solution.
- region_regex: Dict[str, Pattern]¶
- name: str = 'tax_reference_number'¶
- autoload: bool = True¶
- filth_cls¶
alias of
scrubadub.filth.en_GB.tax_reference_number.TaxReferenceNumberFilth
scrubadub.detectors.TwitterDetector¶
- class scrubadub.detectors.TwitterDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Bases:
scrubadub.detectors.base.RegexDetector
Use regular expression magic to remove twitter usernames from dirty dirty
text
.- regex: Optional[Pattern[str]]¶
Compiled regular expression object.
- filth_cls¶
alias of
scrubadub.filth.twitter.TwitterFilth
- name: str = 'twitter'¶
- autoload: bool = True¶
scrubadub.detectors.UrlDetector¶
- class scrubadub.detectors.UrlDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Bases:
scrubadub.detectors.base.RegexDetector
Use regular expressions to remove URLs that begin with
http://
,https://
orwww.
from dirty dirtytext
.With
keep_domain=True
, this detector only obfuscates the path on a URL, not its domain. For example,http://twitter.com/someone/status/234978haoin
becomeshttp://twitter.com/{{replacement}}
.- regex: Optional[Pattern[str]]¶
Compiled regular expression object.
- filth_cls¶
alias of
scrubadub.filth.url.UrlFilth
- name: str = 'url'¶
- autoload: bool = True¶
scrubadub.detectors.VehicleLicencePlateDetector¶
- class scrubadub.detectors.VehicleLicencePlateDetector(**kwargs)[source]¶
Bases:
scrubadub.detectors.base.RegionLocalisedRegexDetector
Detects standard british licence plates.
- region_regex: Dict[str, Pattern]¶
- filth_cls¶
alias of
scrubadub.filth.vehicle_licence_plate.VehicleLicencePlateFilth
- name: str = 'vehicle_licence_plate'¶
- autoload: bool = True¶
Optional detectors¶
These detectors need to be manually added to a Scrubber
, they are not loaded automatically.
An example is shown below that demonstrates the various ways that a detector can be added to a Scrubber
:
>>> import scrubadub
>>> scrubber = scrubadub.Scrubber(detector_list=[scrubadub.detectors.TextBlobNameDetector()])
>>> scrubber.add_detector(scrubadub.detectors.CreditCardDetector)
>>> scrubber.add_detector('skype')
>>> detector = scrubadub.detectors.DateOfBirthDetector(require_context=True)
>>> scrubber.add_detector(detector)
For further information see the usage page.
scrubadub.detectors.DateOfBirthDetector¶
- class scrubadub.detectors.DateOfBirthDetector(context_before: int = 2, context_after: int = 1, require_context: bool = True, context_words: Optional[List[str]] = None, **kwargs)[source]¶
Bases:
scrubadub.detectors.base.Detector
This detector aims to detect dates of birth in text.
First all possible dates are found, then they are filtered to those that would result in people being between
DateOfBirthFilth.min_age_years
andDateOfBirthFilth.max_age_years
, which default to 18 and 100 respectively.If
require_context
is True, we search for one of the possiblecontext_words
near the found date. We search up tocontext_before
lines before the date and up tocontext_after
lines after the date. The context that we search for are terms like ‘birth’ or ‘DoB’ to increase the likelihood that the date is indeed a date of birth. The context words can be set using thecontext_words
parameter, which expects a list of strings.>>> import scrubadub, scrubadub.detectors.date_of_birth >>> DateOfBirthFilth.min_age_years = 12 >>> scrubber = scrubadub.Scrubber(detector_list=[ ... scrubadub.detectors.date_of_birth.DateOfBirthDetector(), ... ]) >>> scrubber.clean("I was born on 10-Nov-2008.") 'I was born {{DATE_OF_BIRTH}}.'
- name: str = 'date_of_birth'¶
- filth_cls¶
alias of
scrubadub.filth.date_of_birth.DateOfBirthFilth
- autoload: bool = False¶
- context_words_language_map = {'de': ['geburt', 'geboren', 'geb', 'geb.'], 'en': ['birth', 'born', 'dob', 'd.o.b.']}¶
- __init__(context_before: int = 2, context_after: int = 1, require_context: bool = True, context_words: Optional[List[str]] = None, **kwargs)[source]¶
Initialise the detector.
- Parameters
context_before (int) – The number of lines of context to search before the date
context_after (int) – The number of lines of context to search after the date
require_context (bool) – Set to False if your dates of birth are not near words that provide context (such as “birth” or “DOB”).
context_words (bool) – A list of words that provide context related to dates of birth, such as the following: ‘birth’, ‘born’, ‘dob’ or ‘d.o.b.’.
name (str, optional) – Overrides the default name of the :class:
Detector
locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- iter_filth(text: str, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None] [source]¶
Search
text
forFilth
and return a generator ofFilth
objects.- Parameters
text (str) – The dirty text that this Detector should search
document_name (Optional[str]) – Name of the document this is being passed to this detector
- Returns
The found Filth in the text
- Return type
Generator[Filth]
- classmethod supported_locale(locale: str) bool [source]¶
Returns true if this
Detector
supports the given locale.- Parameters
locale (str) – The locale of the documents in the format: 2 letter lower-case language code eg “en”, “es”.
- Returns
True
if the locale is supported, otherwiseFalse
- Return type
bool
scrubadub.detectors.SkypeDetector¶
- class scrubadub.detectors.SkypeDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Bases:
scrubadub.detectors.base.RegexDetector
Skype usernames tend to be used inline in dirty dirty text quite often but also appear as
skype: {{SKYPE}}
quite a bit. This method looks at words withinword_radius
words of “skype” for things that appear to be misspelled or have punctuation in them as a means to identify skype usernames.Default
word_radius
is 10, corresponding with the rough scale of half of a sentence before or after the word “skype” is used. Increasing theword_radius
will increase the false positive rate and decreasing theword_radius
will increase the false negative rate.- filth_cls¶
alias of
scrubadub.filth.skype.SkypeFilth
- name: str = 'skype'¶
- autoload: bool = False¶
- word_radius = 10¶
- SKYPE_TOKEN = '[a-zA-Z][a-zA-Z0-9_\\-\\,\\.]+'¶
- SKYPE_USERNAME = re.compile('[a-zA-Z][a-zA-Z0-9_\\-\\,\\.]{5,31}')¶
- iter_filth(text, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None] [source]¶
Yields discovered filth in the provided
text
.- Parameters
text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.
- Returns
An iterator to the discovered
Filth
- Return type
Iterator[
Filth
]
scrubadub.detectors.TaggedEvaluationFilthDetector¶
- class scrubadub.detectors.TaggedEvaluationFilthDetector(known_filth_items: List[scrubadub.detectors.tagged.KnownFilthItem], **kwargs)[source]¶
Bases:
scrubadub.detectors.base.Detector
Use this
Detector
to find tag filth as trueFilth
. This is useful when you want evaluate the effectiveness of a Detector using Filth that has been selected by a human.Results from this detector are used as the “truth” against which the other detectos are compared. This is done in
scrubadub.comparison.get_filth_classification_report
where the detecton accuracies are calculated.An example of how to use this detector is given below:
>>> import scrubadub, scrubadub.comparison, scrubadub.detectors.text_blob >>> scrubber = scrubadub.Scrubber(detector_list=[ ... scrubadub.detectors.text_blob.TextBlobNameDetector(name='name_detector'), ... scrubadub.detectors.TaggedEvaluationFilthDetector([ ... {'match': 'Tom', 'filth_type': 'name'}, ... {'match': 'tom@example.com', 'filth_type': 'email'}, ... ]), ... ]) >>> filth_list = list(scrubber.iter_filth("Hello I am Tom")) >>> print(scrubadub.comparison.get_filth_classification_report(filth_list)) filth detector locale precision recall f1-score support name name_detector en_US 1.00 1.00 1.00 1 accuracy 1.00 1 macro avg 1.00 1.00 1.00 1 weighted avg 1.00 1.00 1.00 1
This detector takes a list of dictonaires (reffered to as known filth items). These specify what to look for in the text to label as tagged filth. The dictionary should contain the following keys:
match
(str) - a string value that will be searched for in the textfilth_type
(str) - a string value that indicates the type of Filth, should be set toFilth.name
. An example of these could be ‘name’ or ‘phone’ for name and phone filths respectively.
The known filth item dictionary may also optionally contain:
match_end
(str) - if specified will search for Filth starting with the value of match and ending with the value ofmatch_end
limit
(int) - an integer describing the maximum number of characters between match and match_end, defaults to 150ignore_case
(bool) - Ignore case when searching for the tagged filthignore_whitespace
(bool) - Ignore whitespace when matching (“asd qwe” can also match “asd\nqwe”)ignore_partial_word_matches
(bool) - Ignore matches that are only partial words (if you’re looking for “Eve”, this flag ensure it wont match “Evening”)
Examples of this:
{'match': 'aaa', 'filth_type': 'name'}
- will search for an exact match to aaa and return it as aNameFilth
{'match': 'aaa', 'match_end': 'zzz', 'filth_type': 'name'}
- will search for aaa followed by up to 150 characters followed by zzz, which would match both aaabbbzzz and aaazzz.{'match': '012345', 'filth_type': 'phone', 'ignore_partial_word_matches': True}
- will search for an exact match to 012345, ignoring any partial matches and return it as aPhoneFilth
This detector is not enabled by default (since you need to supply a list of known filths) and so you must always add it to your scrubber with a
scrubber.add_detector(detector)
call or by adding it to thedetector_list
inialising aScrubber
.- filth_cls¶
alias of
scrubadub.filth.tagged.TaggedEvaluationFilth
- name: str = 'tagged'¶
- autoload: bool = False¶
- __init__(known_filth_items: List[scrubadub.detectors.tagged.KnownFilthItem], **kwargs)[source]¶
Initialise the
Detector
.- Parameters
known_filth_items (list of dicts) – A list of dictionaries that describe items to be searched for in the dirty text. The keys match and filth_type are required, which give the text to be searched for and the type of filth that the match string represents. See the class docstring for further details of available flags in this dictionary.
tagged_filth (bool, default True) – Whether the filth has been tagged and should be used as truth when calculating filth finding accuracies.
name (str, optional) – Overrides the default name of the :class:
Detector
locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- static dedup_dicts(known_filth_items: List[scrubadub.detectors.tagged.KnownFilthItem]) List[scrubadub.detectors.tagged.KnownFilthItem] [source]¶
- create_filth(start_location: int, end_location: int, text: str, comparison_type: Optional[str], detector_name: str, document_name: Optional[str], locale: str) scrubadub.filth.base.Filth [source]¶
- iter_filth(text: str, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None] [source]¶
Yields discovered filth in the provided
text
.- Parameters
text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.
- Returns
An iterator to the discovered
Filth
- Return type
Iterator[
Filth
]
scrubadub.detectors.TextBlobNameDetector¶
- class scrubadub.detectors.TextBlobNameDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Bases:
scrubadub.detectors.base.RegexDetector
Use part of speech tagging from textblob to clean proper nouns out of the dirty dirty
text
. Disallow particular nouns by adding them to theNameDetector.disallowed_nouns
set.- filth_cls¶
alias of
scrubadub.filth.name.NameFilth
- name: str = 'text_blob_name'¶
- autoload: bool = False¶
- disallowed_nouns = {'skype'}¶
- iter_filth(text, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None] [source]¶
Yields discovered filth in the provided
text
.- Parameters
text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.
- Returns
An iterator to the discovered
Filth
- Return type
Iterator[
Filth
]
- classmethod supported_locale(locale: str) bool [source]¶
Returns true if this
Detector
supports the given locale.- Parameters
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- Returns
True
if the locale is supported, otherwiseFalse
- Return type
bool
scrubadub.detectors.UserSuppliedFilthDetector¶
- class scrubadub.detectors.UserSuppliedFilthDetector(known_filth_items: List[scrubadub.detectors.tagged.KnownFilthItem], **kwargs)[source]¶
Bases:
scrubadub.detectors.tagged.TaggedEvaluationFilthDetector
Use this
Detector
to find some known filth in the text. An example might be if you have a list of employee numbers that you wish to remove from a document, as shown below:>>> import scrubadub >>> scrubber = scrubadub.Scrubber(detector_list=[ ... scrubadub.detectors.UserSuppliedFilthDetector([ ... {'match': 'Anika', 'filth_type': 'name'}, ... {'match': 'Larry', 'filth_type': 'name'}, ... ]), ... ]) >>> scrubber.clean("Anika is my favourite employee.") '{{NAME}} is my favourite employee.'
This detector takes a list of dictonaires (reffered to as known filth items). These specify what to look for in the text to label as tagged filth. The dictionary should contain the following keys:
match
(str) - a string value that will be searched for in the textfilth_type
(str) - a string value that indicates the type of Filth, should be set toFilth.name
. An example of these could be ‘name’ or ‘phone’ for name and phone filths respectively.
The known filth item dictionary may also optionally contain:
match_end
(str) - if specified will search for Filth starting with the value of match and ending with the value ofmatch_end
limit
(int) - an integer describing the maximum number of characters between match and match_end, defaults to 150ignore_case
(bool) - Ignore case when searching for the tagged filthignore_whitespace
(bool) - Ignore whitespace when matching (“asd qwe” can also match “asd\nqwe”)ignore_partial_word_matches
(bool) - Ignore matches that are only partial words (if you’re looking for “Eve”, this flag ensure it wont match “Evening”)
Examples of this:
{'match': 'aaa', 'filth_type': 'name'}
- will search for an exact match to aaa and return it as aNameFilth
{'match': 'aaa', 'match_end': 'zzz', 'filth_type': 'name'}
- will search for aaa followed by up to 150 characters followed by zzz, which would match both aaabbbzzz and aaazzz.{'match': '012345', 'filth_type': 'phone', 'ignore_partial_word_matches': True}
- will search for an exact match to 012345, ignoring any partial matches and return it as aPhoneFilth
This detector is not enabled by default (since you need to supply a list of known filths) and so you must always add it to your scrubber with a
scrubber.add_detector(detector)
call or by adding it to thedetector_list
inialising aScrubber
.- name: str = 'user_supplied'¶
- create_filth(start_location: int, end_location: int, text: str, comparison_type: Optional[str], detector_name: str, document_name: Optional[str], locale: str) scrubadub.filth.base.Filth [source]¶
External detectors¶
These are detectors that are not included in the scrubadub
package, usually because they come with large
external dependencies that are not always needed.
To use them you should first import their package and then add them to the Scrubber
, an example of this is shown
below:
>>> import scrubadub, scrubadub_address
>>> scrubber = scrubadub.Scrubber()
>>> scrubber.add_detector(scrubadub_address.detectors.AddressDetector)
scrubadub_address.detectors.AddressDetector¶
scrubadub_spacy.detectors.SpacyEntityDetector¶
- class scrubadub_spacy.detectors.SpacyEntityDetector(named_entities: Optional[Iterable[str]] = None, model: Optional[str] = None, **kwargs)[source]¶
Bases:
scrubadub.detectors.base.Detector
Use spaCy’s named entity recognition to identify possible
Filth
.This detector is made to work with v3 of spaCy, since the NER model has been significantly improved in this version.
This is particularly useful to remove names from text, but can also be used to remove any entity that is recognised by spaCy. A full list of entities that spacy supports can be found here: https://spacy.io/api/annotation#named-entities.
Additional entities can be added like so:
>>> import scrubadub, scrubadub_spacy >>> class MoneyFilth(scrubadub.filth.Filth): ... type = 'money' >>> scrubadub_spacy.detectors.spacy.SpacyEntityDetector.filth_cls_map['MONEY'] = MoneyFilth >>> detector = scrubadub_spacy.detectors.spacy.SpacyEntityDetector(named_entities=['MONEY']) >>> scrubber = scrubadub.Scrubber(detector_list=[detector]) >>> scrubber.clean("You owe me 12 dollars man!") 'You owe me {{MONEY}} man!'
The dictonary
scrubadub_spacy.detectors.spacy.SpacyEntityDetector.filth_cls_map
is used to map between the spaCy named entity label and the type of scrubadubFilth
, while thenamed_entities
argument sets which named entities are consideredFilth
by theSpacyEntityDetector
.- filth_cls_map = {'DATE': <class 'scrubadub.filth.date_of_birth.DateOfBirthFilth'>, 'FAC': <class 'scrubadub.filth.location.LocationFilth'>, 'GPE': <class 'scrubadub.filth.location.LocationFilth'>, 'LOC': <class 'scrubadub.filth.location.LocationFilth'>, 'ORG': <class 'scrubadub.filth.organization.OrganizationFilth'>, 'PER': <class 'scrubadub.filth.name.NameFilth'>, 'PERSON': <class 'scrubadub.filth.name.NameFilth'>}¶
- name: str = 'spacy'¶
- language_to_model = {'de': 'de_dep_news_trf', 'en': 'en_core_web_trf', 'es': 'es_dep_news_trf', 'fr': 'fr_dep_news_trf', 'nl': 'nl_core_news_trf', 'zh': 'zh_core_web_trf'}¶
- disallowed_nouns = {'skype'}¶
- __init__(named_entities: Optional[Iterable[str]] = None, model: Optional[str] = None, **kwargs)[source]¶
Initialise the
Detector
.- Parameters
named_entities (Iterable[str], optional) – Limit the named entities to those in this list, defaults to
{'PERSON', 'PER', 'ORG'}
model (str, optional) – The name of the spacy model to use, it must contain a ‘ner’ step in the model pipeline (most do, but not all).
name (str, optional) – Overrides the default name of the :class:
Detector
locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- iter_filth_documents(document_list: Sequence[str], document_names: Sequence[Optional[str]]) Generator[scrubadub.filth.base.Filth, None, None] [source]¶
Yields discovered filth in a list of documents.
- Parameters
document_list (List[str]) – A list of documents to clean.
document_names (List[str]) – A list containing the name of each document.
- Returns
An iterator to the discovered
Filth
- Return type
Iterator[
Filth
]
- iter_filth(text: str, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None] [source]¶
Yields discovered filth in the provided
text
.- Parameters
text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.
- Returns
An iterator to the discovered
Filth
- Return type
Iterator[
Filth
]
- classmethod supported_locale(locale: str) bool [source]¶
Returns true if this
Detector
supports the given locale.- Parameters
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- Returns
True
if the locale is supported, otherwiseFalse
- Return type
bool
scrubadub_spacy.detectors.SpacyNameDetector¶
- class scrubadub_spacy.detectors.SpacyNameDetector(include_spacy: bool = True, **kwargs)[source]¶
Bases:
scrubadub_spacy.detectors.spacy.SpacyEntityDetector
Add an extension to the spacy detector to look for tokens that often occur before or after names of people’s names, a prefix might be Hello as in “Hello Jane”, or Mrs as in “Mrs Jane Smith” and a suffix could be PhD as in “Jane Smith PhD”.
See the
SpacyDetector
for further info on how to use this detector as it shares many similar options.Currently only english prefixes and sufixes are supported, but other language titles can be easily added, as in the example below:
>>> import scrubadub, scrubadub_spacy >>> scrubadub_spacy.detectors.spacy_name_title.SpacyNameDetector.NOUN_TAGS['de'] = ['NN', 'NE', 'NNE'] >>> scrubadub_spacy.detectors.spacy_name_title.SpacyNameDetector.NAME_PREFIXES['de'] = ['frau', 'herr'] >>> detector = scrubadub_spacy.detectors.spacy_name_title.SpacyNameDetector(locale='de_DE', model='de_core_news_sm', ... include_spacy=False) >>> scrubber = scrubadub.Scrubber(detector_list=[detector], locale='de_DE') >>> scrubber.clean("bleib dort Frau Schmidt") 'bleib dort {{NAME+NAME}}'
- name: str = 'spacy_name'¶
- NAME_PREFIXES = {'en': ['mr', 'mr.', 'mister', 'mrs', 'mrs.', 'misses', 'ms', 'ms.', 'miss', 'dr', 'dr.', 'doctor', 'prof', 'prof.', 'professor', 'lord', 'lady', 'rev', 'rev.', 'reverend', 'hon', 'hon.', 'honourable', 'hhj', 'honorable', 'judge', 'sir', 'madam', 'hello', 'dear', 'hi', 'hey', 'regards', 'to:', 'from:', 'sender:']}¶
- NAME_SUFFIXES = {'en': ['phd', 'bsc', 'msci', 'ba', 'md', 'qc', 'ma', 'mba']}¶
- NOUN_TAGS = {'en': ['NNP', 'NN', 'NNPS']}¶
- TOKEN_SEARCH_DISTANCE = 3¶
- MINIMUM_NAME_LENGTH = 1¶
- __init__(include_spacy: bool = True, **kwargs)[source]¶
Initialise the
Detector
.- Parameters
include_spacy (bool, default, False) – include default spacy library in addition to title detector.
named_entities (Iterable[str], optional) – Limit the named entities to those in this list, defaults to
{'PERSON', 'PER', 'ORG'}
.model (str, optional) – The name of the spacy model to use, it must contain a ‘ner’ step in the model pipeline (most do, but not all).
name (str, optional) – Overrides the default name of the :class:
Detector
locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- static find_names(doc: spacy.tokens.doc.Doc, tokens: Sequence[spacy.tokens.token.Token], noun_tags: List[str]) spacy.tokens.doc.Doc [source]¶
This function searches for possilbe names in a flagged set of tokens and adds them to the identified entities.
- iter_filth_documents(document_list: Sequence[str], document_names: Sequence[Optional[str]]) Generator[scrubadub.filth.base.Filth, None, None] [source]¶
Yields discovered filth in a list of documents.
- Parameters
document_list (List[str]) – A list of documents to clean.
document_names (List[str]) – A list containing the name of each document.
- Returns
A list containing all the spacy doc
- Return type
Sequence[Optional[str]]
- classmethod supported_locale(locale: str) bool [source]¶
Returns true if this
Detector
supports the given locale.- Parameters
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- Returns
True
if the locale is supported, otherwiseFalse
- Return type
bool
scrubadub_stanford.detectors.StanfordEntityDetector¶
- class scrubadub_stanford.detectors.StanfordEntityDetector(enable_person: bool = True, enable_organization: bool = True, enable_location: bool = False, **kwargs)[source]¶
Bases:
scrubadub.detectors.base.Detector
Search for people’s names, organization’s names and locations within text using the stanford 3 class model.
The three classes of this model can be enabled with the three arguments to the inialiser enable_person, enable_organization and enable_location. An example of their usage is given below.
>>> import scrubadub, scrubadub_stanford >>> detector = scrubadub_stanford.detectors.StanfordEntityDetector( ... enable_person=False, enable_organization=False, enable_location=True ... ) >>> scrubber = scrubadub.Scrubber(detector_list=[detector]) >>> scrubber.clean('Jane is visiting London.') 'Jane is visiting {{LOCATION}}.'
- filth_cls¶
alias of
scrubadub.filth.base.Filth
- name: str = 'stanford'¶
- ignored_words = ['tennant']¶
- stanford_version = '4.0.0'¶
- stanford_download_url = 'https://nlp.stanford.edu/software/stanford-ner-{version}.zip'¶
- __init__(enable_person: bool = True, enable_organization: bool = True, enable_location: bool = False, **kwargs)[source]¶
Initialise the
Detector
.- Parameters
name (str, optional) – Overrides the default name of the :class:
Detector
locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- iter_filth(text, document_name: Optional[str] = None)[source]¶
Yields discovered filth in the provided
text
.- Parameters
text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.
- Returns
An iterator to the discovered
Filth
- Return type
Iterator[
Filth
]
- classmethod supported_locale(locale: str) bool [source]¶
Returns true if this
Detector
supports the given locale.- Parameters
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- Returns
True
if the locale is supported, otherwiseFalse
- Return type
bool
Catalogue functions¶
These functions register or remove Detector
s from the Detector
catalogue.
scrubadub.detectors.register_detector¶
- scrubadub.detectors.register_detector(detector: Type[Detector], *, autoload: Optional[bool] = None) Type[Detector] [source]¶
Register a detector for use with the
Scrubber
class.You can use
register_detector(NewDetector, autoload=True)
after your detector definition to automatically register it with theScrubber
class so that it can be used to remove Filth.The argument
autoload``decides whether a new ``Scrubber()
instance should load thisdetector
by default.>>> import scrubadub >>> class NewDetector(scrubadub.detectors.Detector): ... pass >>> scrubadub.detectors.register_detector(NewDetector, autoload=False) <class 'scrubadub.detectors.catalogue.NewDetector'>
- Parameters
detector (Detector class) – The
Detector
to register with the scrubadub detector configuration.autoload (Optional[bool]) – Whether to automatically load this
Detector
onScrubber
initialisation.
scrubadub.detectors.remove_detector¶
- scrubadub.detectors.remove_detector(detector: Union[Type[Detector], str])[source]¶
Remove an already registered detector.
>>> import scrubadub >>> class NewDetector(scrubadub.detectors.Detector): ... pass >>> scrubadub.detectors.catalogue.register_detector(NewDetector, autoload=False) <class 'scrubadub.detectors.catalogue.NewDetector'> >>> scrubadub.detectors.catalogue.remove_detector(NewDetector)
- Parameters
detector (Union[Type['PostProcessor'], str]) – The
Detector
to register with the scrubadub detector configuration.autoload (bool) – Whether to automatically load this
Detector
onScrubber
initialisation.