New system automatically cleans up messy data tables | MIT News



MIT researchers have created a new system that automatically cleans up “dirty data” – typos, duplicates, missing values, misspellings and inconsistencies feared by data analysts, data engineers and more. data scientists. The system, called PClean, is the latest in a series of domain-specific probabilistic programming languages ​​written by researchers at the Probabilistic Computing Project that aim to simplify and automate the development of AI applications (others include them. one for 3D perception via inverted graphics and another for time series and database modeling).

According to surveys conducted by Anaconda and Figure Eight, cleaning data can take a quarter of a data scientist’s time. Automating the task is difficult because different sets of data require different types of cleaning, and common sense calls on objects around the world are often required (for example, in which of several cities called “Beverly Hills” a person lives. ). PClean provides generic common sense models for these types of judgment calls that can be customized for specific databases and types of errors.

PClean uses a knowledge-based approach to automate the data cleansing process: users encode basic knowledge about the database and the types of issues likely to arise. Take, for example, the problem of cleaning up state names in an apartment listing database. What if someone says they live in Beverly Hills but leaves the state column empty? While there is a well-known Beverly Hills in California, there is also one in Florida, Missouri, and Texas… and there is a neighborhood in Baltimore known as Beverly Hills. How do you know which person the person lives in? This is where PClean’s expressive scripting language comes in. Users can give PClean basic knowledge about the domain and how data can get corrupted. PClean combines this knowledge through common sense probabilistic reasoning to find the answer. For example, given the additional knowledge about typical rents, PClean infers that Beverly Hills is in California due to the high cost of rent where the respondent lives.

Alex Lew, the lead author of the article and doctoral student in the Department of Electrical Engineering and Computer Science (EECS), says he’s very happy that PClean gives a way to get help from computers in the same way that people seek help from one. another. “When I ask a friend for help with something, it’s often easier than asking a computer. This is because in today’s dominant programming languages ​​I have to give step-by-step instructions, which cannot assume that the computer has context about the world or the task – or even just some. common sense reasoning skills. With a human, I can handle all of these things, ”he says. “PClean is a step towards closing this gap. This allows me to tell the computer what I know about a problem, coding the same kind of basic knowledge that I would explain to someone who helps me clean my data. I can also give PClean tips, tricks and tips that I have already discovered to solve the task faster. “

The co-authors are Monica Agrawal, doctoral student in EECS; David Sontag, Associate Professor at EECS; and Vikash K. Mansinghka, senior researcher in the Department of Brain and Cognitive Sciences.

What innovations make this work?

The idea that probabilistic cleansing based on declarative and generative knowledge could potentially provide much greater precision than machine learning was previously suggested in a 2003 paper by Hanna Pasula and others from Stuart Russell’s lab at the University of California, Berkeley. “Ensuring data quality is a huge problem in the real world, and almost all existing solutions are ad hoc, expensive, and error-prone,” says Russell, professor of computer science at UC Berkeley. “PClean is the first scalable, well-designed, general-purpose generative data modeling solution that must be the right way forward. The results speak for themselves. “Co-author Agrawal adds that” the existing data cleansing methods are more limited in their expressiveness, which can be more user-friendly, but at the expense of being quite limiting. we have found that PClean can scale to very large data sets that have unrealistic run times in legacy systems. ”

PClean builds on recent advances in probabilistic programming, including new model of AI programming built at the MIT Probabilistic Computing Project which makes it much easier to apply realistic models of human knowledge to interpret data. PClean repairs are based on Bayesian reasoning, an approach that evaluates alternative explanations for ambiguous data by applying probabilities based on prior knowledge to available data. “The ability to make those kinds of uncertain decisions, where we want to tell the computer what sort of things it’s likely to see, and have the computer automatically use that to determine what is probably the correct answer, is essential. for probabilistic programming, ”explains Lew.

PClean is the first Bayesian data cleansing system that can combine domain expertise with common sense reasoning to automatically clean databases of millions of records. PClean achieves this scale thanks to three innovations. First of all, PClean’s scripting language allows users to encode what they know. This gives precise models, even for complex databases. Second, PClean’s inference algorithm uses a two-phase approach, based on processing records one by one to make educated guesses on how to clean them up, then revisiting its judgment calls for correcting errors. This gives robust and accurate inference results. Third, PClean provides a custom compiler that generates fast inference code. This allows PClean to run on databases containing millions of records at a faster rate than several competing approaches. “PClean users can give PClean advice on how to reason more effectively about their database and how to tune its performance, unlike previous probabilistic programming approaches to data cleansing, which relied primarily on algorithms for generic inference often too slow or imprecise, ”says Mansinghka.

As with all probabilistic programs, the lines of code needed to run the tool are much less than the advanced alternative options: PClean programs only need around 50 lines of code to outperform benchmarks in terms of precision and execution. For comparison, a simple snake cell phone game takes twice as many lines of code to run, and Minecraft has well over a million lines of code.

In their paper, which was just presented at the 2021 Society for Artificial Intelligence and Statistics conference, the authors show PClean’s ability to adapt to datasets containing millions of records using PClean to detect errors and impute missing values ​​in the National 2.2 million-row Medicare Physician Compare. database. Running for only seven and a half hours, PClean detected over 8,000 errors. The authors then checked by hand (via researching hospital websites and physician LinkedIn pages) that for over 96% of them the fix offered by PClean was correct.

Since PClean is based on Bayesian probability, it can also give calibrated estimates of its uncertainty. “It can hold multiple assumptions – give you judged judgments, not just yes / no answers. This builds confidence and helps users replace PClean when necessary. For example, you can look at a judgment where PClean was uncertain and give it the correct answer. He can then update the rest of his judgments in light of your comments, “Mansinghka says.” We think there is a lot of potential value in this kind of interactive process that intertwines human judgment and machine judgment. We see PClean as one of the first examples of a new type of AI system that can tell more about what people know, report when in doubt, and reason and interact with people in more useful and useful ways. more human. “

David Pfau, Principal Investigator at DeepMind, noted in a tweet that PClean responds to a business need: “When you consider that the vast majority of corporate data is not images of dogs, but entries into relational databases and spreadsheets, it’s amazing that things like this do not yet have the success learning has. “

Benefits, risks and regulations

With PClean, it’s cheaper and easier to join messy and inconsistent databases into clean records, without the massive investments in human and software systems that data-centric businesses today rely on. This has potential social benefits – but also risks, among which PClean can make it cheaper and easier to invade people’s privacy, and potentially even invade people’s privacy. de-anonymize them, by attaching incomplete information from several public sources.

“We ultimately need much stronger data, artificial intelligence and privacy regulations to mitigate these types of damage,” Mansinghka says. Lew adds: “Compared to machine learning approaches to data cleansing, PClean could allow finer regulatory control. For example, PClean can tell us not only that it has merged two records as referring to the same person, but also why it I did – and I can come to my own judgment as to whether I agree. I can even tell PClean to only consider certain reasons for merging two entries. ”Unfortunately, the researchers say , privacy concerns persist, no matter how well a dataset is cleaned.

Mansinghka and Lew are excited to help people pursue socially beneficial applications. They were approached by people interested in using PClean to improve data quality for journalism and humanitarian applications, such as anti-corruption monitoring and the consolidation of donor files submitted to state electoral councils. Agrawal says she hopes PClean will free up data scientists’ time, “to focus on the issues they care about rather than cleaning up the data.” Early feedback and the excitement around PClean suggests it could be, which we’re excited to hear. “


Leave A Reply

Your email address will not be published.