scrubadub.comparison¶
Filth objects are responsible for marking particular sections of text as
containing that type of filth. It is also responsible for knowing how it should
be cleaned. Every type of Filth
inherits from scrubadub.filth.base.Filth
.
- scrubadub.comparison.get_filth_classification_report(filth_list: List[scrubadub.filth.base.Filth], combine_detectors: bool = False, groupby_documents: bool = False, output_dict: bool = False) Optional[Union[str, Dict[str, float]]] [source]¶
Evaluates the performance of detectors using KnownFilth.
An example of using this is shown below:
>>> import scrubadub, scrubadub.comparison, scrubadub.detectors.text_blob >>> scrubber = scrubadub.Scrubber(detector_list=[ ... scrubadub.detectors.text_blob.TextBlobNameDetector(name='name_detector'), ... scrubadub.detectors.TaggedEvaluationFilthDetector([ ... {'match': 'Tom', 'filth_type': 'name'}, ... {'match': 'tom@example.com', 'filth_type': 'email'}, ... ]), ... ]) >>> filth_list = list(scrubber.iter_filth("Hello I am Tom")) >>> print(scrubadub.comparison.get_filth_classification_report(filth_list)) filth detector locale precision recall f1-score support name name_detector en_US 1.00 1.00 1.00 1 accuracy 1.00 1 macro avg 1.00 1.00 1.00 1 weighted avg 1.00 1.00 1.00 1
- Parameters
filth_list (A list of Filth objects) – The list of detected filth
combine_detectors (bool, optional) – Combine performance of all detectors for the same filth/locale
groupby_documents (bool, optional) – Show performance for each file individually
output_dict (bool, optional) – Return the report in JSON format, defautls to False
- Returns
The report in JSON (a dict) or in plain text
- Return type
str or dict
- scrubadub.comparison.get_filth_dataframe(filth_list: List[scrubadub.filth.base.Filth]) pandas.core.frame.DataFrame [source]¶
Produces a pandas DataFrame to allow debugging and improving detectors.
An example of using this is shown below:
>>> import scrubadub, scrubadub.comparison, scrubadub.detectors.text_blob >>> scrubber = scrubadub.Scrubber(detector_list=[ ... scrubadub.detectors.text_blob.TextBlobNameDetector(name='name_detector'), ... scrubadub.detectors.TaggedEvaluationFilthDetector([ ... {'match': 'Tom', 'filth_type': 'name'}, ... {'match': 'tom@example.com', 'filth_type': 'email'}, ... ]), ... ]) >>> filth_list = list(scrubber.iter_filth("Hello I am Tom")) >>> with pd.option_context("display.max_columns", 20): ... print(scrubadub.comparison.get_filth_dataframe(filth_list)) group_id filth_id filth_type detector_name document_name text beg end \ 0 0 0 name name_detector None Tom 11 14 locale known_filth comparison_type known_text known_beg known_end \ 0 en_US True NaN Tom 11 14 known_comparison_type exact_match partial_match true_positive \ 0 name True True True false_positive false_negative 0 False False
- Parameters
filth_list (A list of Filth objects) – The list of detected filth
- Returns
A pd.DataFrame containing infomatoin about the detected Filth
- Return type
pd.DataFrame
- scrubadub.comparison.make_fake_document(paragraphs: int = 20, locale: str = 'en_US', seed: Optional[int] = None, faker: Optional[faker.proxy.Faker] = None, filth_types: Optional[List[str]] = None, fake_text_function: Optional[Callable[[...], str]] = None, additional_filth_types: Optional[Iterable[Type[scrubadub.filth.base.Filth]]] = None) Tuple[str, List[scrubadub.detectors.tagged.KnownFilthItem]] [source]¶
Creates a fake document containing Filth that needs to be removed. Also returns the list of known filth items that are needed by the TaggedEvaluationFilthDetector.
An example of using this is shown below:
>>> import scrubadub, scrubadub.comparison >>> document, known_filth_items = scrubadub.comparison.make_fake_document(paragraphs=1, seed=1) >>> scrubber = scrubadub.Scrubber() >>> scrubber.add_detector(scrubadub.detectors.TaggedEvaluationFilthDetector( ... known_filth_items=known_filth_items ... )) >>> filth_list = list(scrubber.iter_filth(document)) >>> print(scrubadub.comparison.get_filth_classification_report(filth_list)) filth detector locale precision recall f1-score support email email en_US 1.00 1.00 1.00 2 url url en_US 1.00 1.00 1.00 1 micro avg 1.00 1.00 1.00 3 macro avg 1.00 1.00 1.00 3 weighted avg 1.00 1.00 1.00 3 samples avg 1.00 1.00 1.00 3
- Parameters
paragraphs (int) – The list of detected filth
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”
seed (int, optional) – The random seed used to generate the document
faker (int) – A Faker object that is used to generate the text
filth_types (List[str]) – A list of the
Filth.type
to generatefake_text_function (Callable, optional) – A function that will generate a 1-3 sentances of text
- Returns
The document and a list of KnownFilthItems
- Return type
Tuple[str, List[KnownFilthItem]]