Deduplication is the task to combine different representations of the same real world entity.
This Python package implements deduplication using active learning. Active learning allows for
rapid training without having to provide a large, manually labelled dataset.
DedupliPy is an end-to-end solution with advantages over existing solutions:
active learning; no large manually labelled dataset required
during active learning, the user gets notified when the model converged and training may
works out of the box, advanced users can choose settings as desired (custom blocking
rules, custom metrics, interaction features)
DedupliPy is developed using modAL, Scikit-Learn and SciPy.
How does it work?
Naively comparing all pairs in a dataset would result in a large number of pairs to check. DedupliPy first
creates blocks of pairs where duplicates are likely to be found. In the next step string similarity metrics
are applied to the pairs and a logistic regression model is trained to learn to predict which pairs belong
to the same real world entity and which don't. In the last step hierarchical clustering is applied to
perform the actual deduplication.