Change Log¶
This project uses semantic versioning to track version numbers, where backwards incompatible changes (highlighted in bold) bump the major version of the package.
latest changes in development for next release¶
2.0.0¶
There have been some changes in the scrubadub API, but few breaking changes. The headline changes include:
Several new detectors have been added (spacy, stanford NER, tax reference number, credit card, …).
Splitting of the scrubadub package into smaller parts.
Added ability to easily evaluate a
Detector
‘s performance, see Accuracy.Started to localise detectors to function for more than one language/location.
Support for scrubbing multiple documents together.
Introduced the concept of a
PostProcessor
. This will allow more complex groupings ofFilth
s and new types of tokenization.New detector configuration/management system.
Scrubber¶
Detector
s andPostProcessor
s can be added and removed using a string containing their default name, their class or an instance.You can clean multiple documents with one
Scrubber().clean_documents(docs)
callA default set of Detectors are loaded instead of all Detectors. This is particularly useful for detectors that are slow or have complex dependencies, as they dont need to be loaded each time. However, this might need an explicit
Scrubber().add_detector(detector)
call for the same behaviour as before.Added a
locale
parameter to theScrubber
initialiser.A
Scrubber
will only auto-load detectors that support a givenScrubber
locale
.The
Scrubber
will ensure that filth are valid with a call toFilth().is_valid()
Detectors¶
The the name of the detector has been separated from the type of filth found. This means multiple instances of the same detector (configured differently) can be in the same
Scrubber
instance and oneDetector
can return multiple types ofFilth
.Detectors now required to define an attribute called name, which should be unique within a
Scrubber
instance.Detectors are now passed a locale argument to the Detector initialiser.
Detectors
have an optionalsupported_locale(locale)
function that returns a bool to indicate if a givenDetector
supports a locale.Regular expressions used by the RegexDetector class have been moved from RegexFilth.regex to RegexDetector.regex.
Renamed SSNDetector to SocialSecurityNumberDetector.
New
AddressDetector
, which detects US, CA and GB addresses.New
CreditCardDetector
, which detects credit card numbers (based on the Detector in the alphagov scrubadub fork).New
DateOfBirthDetector
, which detects dates of birth (thanks to @mirandachong).New
DriversLicenceDetector
, which detects GB drivers licence numbers.New
TaggedEvaluationFilthDetector
, which is used to tag real filth in text when you’re evaluating the quality of your filth removal.New
UserSuppliedFilthDetector
, which is used to find bits of Filth that you know will be in the text.New
PostalCodeDetector
, which detects GB post codes.New
SpacyEntityDetector
, which detects a range of named entities, including names (thanks to @aCampello).New
StanfordEntityDetector
, which also detects slightly different range of named entities, including names.New
NationalInsuranceNumberDetector
, which detects GB National Insurance Numbers (NINO) (thanks to @mirandachong).New
TaxReferenceNumberDetector
, which detects GB Tax Reference Numbers (TRN) (thanks to @mirandachong).New
VehicleLicencePlateDetector
, which detects number plates on GB cars (based on the Detector in the alphagov scrubadub fork).New
RegionLocalisedRegexDetector
, which derived from the convenience classRegexDetector
to allow for quickly creating regional regex based detectors.Detector
s can now be registered to a catalogue ofDetector
s. This allows detectors to be defined in separate packages.
Filth¶
Introduced three parameters in the constructor detector_name, document_name and locale. These keep track of the
Detector
that found theFilth
, the document it came from and the documents locale. This results inFilth
objects being passed additional parameters on initialisation. If you have defined customFilth
s they will need to be updated so thatFilth.__init__
accepts thedetector_name
,document_name
andlocale
keywords and call the base class constructor.Added a
generate()
function that allows to generate fake examples of thatFilth
. This can be used to help evaluate detector performance.Added an
is_valid()
function, this can be used to ensure that a piece of detected filth is indeed valid.
PostProcessors¶
- Introduction of simple
PostProcessors
: FilthReplacer
: Replace the filth with the type of filthexample@example.com -> EMAIL
, a configurable hashexample@example.com -> 196aa39e9f8159ec
or a monotonically increasing number for each unique piece of filth (optionally including the filth type)example@example.com -> EMAIL-1
.PrefixSuffixReplacer
: Add a prefix and/or suffix onto the replacementEMAIL-1 -> {{EMAIL-1}}
- Introduction of simple
It is envisioned that other more complex operations can be done here too such as grouping filth (e.g. “John”, “John Doe” and “Mr. Doe” could be grouped together).
1.2.2¶
LeapBeyond are now supporting scrubadub with maintanance and development.
bug fixes:
StopIteration no longer supported in recent python varions (#41 via @roman-y-korolev)
Fix test runner with python 3 (#42 via @roman-y-korolev)
Update documentation to reflect new repository location (#49)
This is the last version that will be explicitly compatible with python 2.7.
1.2.1¶
bug fixes:
bumped
textblob
version (#43 via @roman-y-korolev)fixed documentation (#32 via @ivyleavedtoadflax)
1.2.0¶
added python 3 compatability (#31 via @davidread)
1.1.1¶
1.1.0¶
1.0.3¶
minor change to force
Detector.filth_cls
to exist (#13)
1.0.1¶
several bug fixes, including:
installation bug (#12)
1.0.0¶
major update to process Filth in parallel (#11)
0.1.0¶
0.0.1¶
initial release, ported from past projects