scrubadub.post_processors¶
PostProcessor
s generally can be used to process the detected Filth
objects and make changes to them.
These are a new addition to scrubadub and at the moment only simple ones exist that alter the replacement string.
- class scrubadub.post_processors.base.PostProcessor(name: Optional[str] = None)[source]¶
Bases:
object
- autoload: bool = False¶
- index: int = 10000¶
- name: str = 'post_processor'¶
- process_filth(filth_list: Sequence[scrubadub.filth.base.Filth]) Sequence[scrubadub.filth.base.Filth] [source]¶
- class scrubadub.post_processors.filth_replacer.FilthReplacer(include_type: bool = True, include_count: bool = False, include_hash: bool = False, uppercase: bool = True, separator: Optional[str] = None, hash_length: Optional[int] = None, hash_salt: Optional[Union[str, bytes]] = None, **kwargs)[source]¶
Bases:
scrubadub.post_processors.base.PostProcessor
Creates tokens that are used to replace the Filth found in the text of a document.
This can be configured to include the filth type (eg phone, name, email, …), a unique number for each piece of Filth, and a hash of the Filth.
>>> import scrubadub >>> scrubber = scrubadub.Scrubber(post_processor_list=[ ... scrubadub.post_processors.FilthReplacer(), ... ]) >>> scrubber.clean("Contact me at 522-368-8530 or hernandezjenna@example.com") 'Contact me at PHONE or EMAIL' >>> scrubber = scrubadub.Scrubber(post_processor_list=[ ... scrubadub.post_processors.FilthReplacer(include_hash=True, hash_salt='example', hash_length=8), ... ]) >>> scrubber.clean("Contact me at 522-368-8530 or hernandezjenna@example.com") 'Contact me at PHONE-7358BF44 or EMAIL-AC0B8AC3' >>> scrubber = scrubadub.Scrubber(post_processor_list=[ ... scrubadub.post_processors.FilthReplacer(include_count=True), ... ]) >>> scrubber.clean("Contact me at taylordaniel@example.com or hernandezjenna@example.com, " ... "but taylordaniel@example.com is probably better.") 'Contact me at EMAIL-0 or EMAIL-1, but EMAIL-0 is probably better.'
- name: str = 'filth_replacer'¶
- autoload: bool = False¶
- index: int = 0¶
- typed_lookup: Dict[str, scrubadub.utils.Lookup] = {}¶
- __init__(include_type: bool = True, include_count: bool = False, include_hash: bool = False, uppercase: bool = True, separator: Optional[str] = None, hash_length: Optional[int] = None, hash_salt: Optional[Union[str, bytes]] = None, **kwargs)[source]¶
Initialise the FilthReplacer.
- Parameters
include_type (bool, default True) –
include_count (bool, default False) –
include_hash (bool, default False) –
uppercase (bool, default True) – Make the label uppercase
separator (Optional[str], default None) – Used to separate labels if a merged filth is being replaced
hash_length (Optional[int], default None) – The length of the hexadecimal hash
hash_salt (Optional[Union[str, bytes]], default None) – The salt used in the hashing process
- filth_label(filth: scrubadub.filth.base.Filth) str [source]¶
This function takes a filth and creates a label that can be used to replace the original text.
- Parameters
filth (Filth) – Limit the named entities to those in this list, defaults to
{'PERSON', 'PER', 'ORG'}
- Returns
The replacement label that should be used for this Filth.
- Return type
str
- static get_hash(text: str, salt: bytes, length: int) str [source]¶
Get a hash of some text, that has been salted and truncated.
- Parameters
text (str) – The text to be hashed
salt (bytes) – The salt that should be used in this hashing
length (int) – The number of characters long that the hexadecimal hash should be
- Returns
The hash of the text
- Return type
str
- process_filth(filth_list: Sequence[scrubadub.filth.base.Filth]) Sequence[scrubadub.filth.base.Filth] [source]¶
Processes the filth to replace the original text
- class scrubadub.post_processors.prefix_suffix.PrefixSuffixReplacer(prefix: Optional[str] = '{{', suffix: Optional[str] = '}}', name: Optional[str] = None)[source]¶
Bases:
scrubadub.post_processors.base.PostProcessor
Add a prefix and/or suffix to the Filth’s replacement string.
>>> import scrubadub >>> scrubber = scrubadub.Scrubber(post_processor_list=[ ... scrubadub.post_processors.FilthReplacer(), ... ]) >>> scrubber.clean("Contact me at 522-368-8530 or hernandezjenna@example.com") 'Contact me at PHONE or EMAIL' >>> scrubber = scrubadub.Scrubber(post_processor_list=[ ... scrubadub.post_processors.FilthReplacer(), ... scrubadub.post_processors.PrefixSuffixReplacer(prefix='{{', suffix='}}'), ... ]) >>> scrubber.clean("Contact me at 522-368-8530 or hernandezjenna@example.com") 'Contact me at {{PHONE}} or {{EMAIL}}' >>> scrubber = scrubadub.Scrubber(post_processor_list=[ ... scrubadub.post_processors.FilthReplacer(), ... scrubadub.post_processors.PrefixSuffixReplacer(prefix='<b>', suffix='</b>'), ... ]) >>> scrubber.clean("Contact me at 522-368-8530 or hernandezjenna@example.com") 'Contact me at <b>PHONE</b> or <b>EMAIL</b>'
- name: str = 'prefix_suffix_replacer'¶
- autoload: bool = False¶
- index: int = 1¶
- __init__(prefix: Optional[str] = '{{', suffix: Optional[str] = '}}', name: Optional[str] = None)[source]¶
- process_filth(filth_list: Sequence[scrubadub.filth.base.Filth]) Sequence[scrubadub.filth.base.Filth] [source]¶
Processes the filth to add prefixes and suffixes to the replacement text
- class scrubadub.post_processors.remover.FilthRemover(name: Optional[str] = None)[source]¶
Bases:
scrubadub.post_processors.base.PostProcessor
Removes all found filth from the original document.
>>> import scrubadub >>> scrubber = scrubadub.Scrubber(post_processor_list=[ ... scrubadub.post_processors.FilthRemover(), ... ]) >>> scrubber.clean("Contact me at 522-368-8530 or hernandezjenna@example.com") 'Contact me at or '
- name: str = 'filth_remover'¶
- autoload: bool = False¶
- index: int = 0¶
- process_filth(filth_list: Sequence[scrubadub.filth.base.Filth]) Sequence[scrubadub.filth.base.Filth] [source]¶
Processes the filth to remove the filth
Catalogue functions¶
scrubadub.post_processors.register_post_processor¶
- scrubadub.post_processors.register_post_processor(post_processor: Type[PostProcessor], autoload: Optional[bool] = None, index: Optional[int] = None) None [source]¶
Register a PostProcessor for use with the
Scrubber
class.You can use
register_post_processor(NewPostProcessor)
after your post-processor definition to automatically register it with theScrubber
class so that it can be used to process Filth.The argument
autoload
sets if a newScrubber()
instance should load thisPostProcessor
by default.- Parameters
post_processor (PostProcessor class) – The
PostProcessor
to register with the scrubadub post-processor configuration.autoload (bool) – Whether to automatically load this
Detector
onScrubber
initialisation.index (int) – The location/index in which this
PostProcessor
should be added.