ActiveClean: a tool that uses machine learning to clean dirty data in big data sets

AMPLab researchers Sanjay Krishnan, Prof. Michael Franklin, Prof. Ken Goldberg, Eugene Wu, and Jiannan Wang have developed ActiveClean, a system that uses machine learning to improve the process of removing dirty data by analyzing a user's prediction model to decide which mistakes to edit first, while updating the model as it works.  The demonstration paper titled "ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning" received the Best Demo Award at SIGMOD 2016.

ActiveClean is profiled in an I Programmer article and the development team led byEugene Wu (now at Columbia) will present its research on Sept. 7 in New Delhi, at the 2016 conference on Very Large Data Bases.