DedupliPy

Deduplication is the task to combine different representations of the same real world entity. This Python package implements deduplication using active learning. Active learning allows for rapid training without having to provide a large, manually labelled dataset.

DedupliPy is an end-to-end solution with advantages over existing solutions:

active learning; no large manually labelled dataset required
during active learning, the user gets notified when the model converged and training may be finished
works out of the box, advanced users can choose settings as desired (custom blocking rules, custom metrics, interaction features)

DedupliPy is developed using modAL, Scikit-Learn and SciPy.

How does it work?

Naively comparing all pairs in a dataset would result in a large number of pairs to check. DedupliPy first creates blocks of pairs where duplicates are likely to be found. In the next step string similarity metrics are applied to the pairs and a logistic regression model is trained to learn to predict which pairs belong to the same real world entity and which don't. In the last step hierarchical clustering is applied to perform the actual deduplication.

DedupliPy

How does it work?

Presented at PyData Global 2021