Subscribe to our free daily newsletters
  Space Travel News  

Subscribe to our free daily newsletters

A data-cleaning tool for building better prediction models
by Staff Writers
New York NY (SPX) Sep 06, 2016

Tested on a dirty, real-world data set, ActiveClean (in red), was able to clean just 5,000 records to bring the researchers' prediction model to a 90 percent accuracy level. The next best technique, called active learning (in green), had to clean 50,000 records to achieve comparable results. The most common data-cleaning method - trial-and-error (in purple) - provided minimal model improvement. Image courtesy Eugene Wu. For a larger version of this image please go here.

Big data sets are full of dirty data, and these outliers, typos and missing values can produce distorted models that lead to wrong conclusions and bad decisions, be it in healthcare or finance. With so much at stake, data cleaning should be easier.

That's the inspiration for software developed by computer scientists at Columbia University and University of California at Berkeley that hands much of the dirty work over to machines. Called ActiveClean, the system analyzes a user's prediction model to decide which mistakes to edit first, while updating the model as it works. With each pass, users see their model improve.

"Dirty data is pervasive and prevents people from doing useful things," said Eugene Wu, a computer science professor at Columbia Engineering and a member of the Data Science Institute. "This is our first step towards automating the data-cleaning process."

The team will present its research on Sept. 7 in New Delhi, at the 2016 conference on Very Large Data Bases. Wu helped develop ActiveClean as a postdoctoral researcher at Berkeley's AMPLab and has continued this work at Columbia.

Big data sets are still mostly combined and edited manually, aided by data-cleaning software like Google Refine and Trifacta, or custom scripts developed for specific data-cleaning tasks. The process consumes up to 80 percent of analysts' time as they hunt for dirty data, clean it, retrain their model, and repeat the process. Cleaning is largely done by guesswork.

"Will it help or hurt the model? You have no idea," said Wu. "Data scientists either clean everything, which is impossible for huge datasets, or clean random subsets and hope for the best."

In the process, statistical biases can be introduced that skew models into producing misleading results. Those mistakes may not be caught until weeks later, as the researchers learned in an earlier survey of industry data scientists.

"Most of these errors are subtle enough that the analysis will go through," said one consultant from a large database vendor. "Usually it's only caught weeks later after someone notices something like, "Well, the Wilmington branch cannot have $1 million sales in a week."

ActiveClean tries to minimize mistakes like these by taking humans out of the most error-prone steps of data cleaning: finding dirty data and updating the model. Using machine learning, the tool analyzes a model's structure to understand what sorts of errors will throw the model off most. It goes after those data first, in decreasing priority, and cleans just enough data to give users assurance that their model will be reasonably accurate.

The researchers tested ActiveClean on Dollars for Docs, a database of corporate donations to doctors that journalists at ProPublica compiled to analyze conflicts of interest and flag improper donations.

ActiveClean's results were compared against two baseline methods. One edited a subset of the data and retrained the model. The other used a popular prioritization algorithm called active learning that picks the most informative labels for ambiguous data. The algorithm improves the model without bothering, as ActiveClean does, whether the labels are accurate.

Nearly a quarter of ProPublica's 240,000 records had multiple names for a drug or company. Left uncorrected these inconsistencies could lead journalists to undercount donations by large companies, which were more likely to have such inconsistencies.

With no data cleaning, a model trained on this dataset could predict an improper donation just 66 percent of the time. ActiveClean, they found, raised the detection rate to 90 percent by cleaning just 5,000 records. The active learning method, by contrast, required 10 times as much data, or 50,000 records, to reach a comparable detection rate.

"As datasets grow larger and more complex, it's becoming more and more difficult to properly clean the data," said study coauthor Sanjay Krishnan, a graduate student at UC Berkeley. "ActiveClean uses machine learning techniques to make data cleaning easier while guaranteeing you won't shoot yourself in the foot."

ActiveClean is a free, open-source tool released in August. Download it here.

Thanks for being here;
We need your help. The SpaceDaily news network continues to grow but revenues have never been harder to maintain.

With the rise of Ad Blockers, and Facebook - our traditional revenue sources via quality network advertising continues to decline. And unlike so many other news sites, we don't have a paywall - with those annoying usernames and passwords.

Our news coverage takes time and effort to publish 365 days a year.

If you find our news sites informative and useful then please consider becoming a regular supporter or for now make a one off contribution.

SpaceDaily Contributor
$5 Billed Once

credit card or paypal
SpaceDaily Monthly Supporter
$5 Billed Monthly

paypal only


Related Links
Columbia University School of Engineering and Applied Science
Space Technology News - Applications and Research

Comment on this article via your Facebook, Yahoo, AOL, Hotmail login.

Share this article via these popular social media networks DiggDigg RedditReddit GoogleGoogle

Previous Report
Streamlining accelerated computing for industry
Oak Ridge TN (SPX) Aug 26, 2016
Scientists and engineers striving to create the next machine-age marvel - whether it be a more aerodynamic rocket, a faster race car, or a higher-efficiency jet engine - depend on reliable analysis and feedback to improve their designs. Building and testing physical prototypes of complex machines can be time-consuming and costly and can provide only limited results. For these reasons, comp ... read more

SpaceX scours data to try to pin down cause rocket explosion on launch pad

India To Launch 5 Satellites In September

With operational acceptance complete, Western Range is ready for launch

Sky Muster II comes to French Guiana for launch on Ariane 5

Storm Reduces Available Solar Energy on Opportunity

NASA Approves 2018 Launch of Mars InSight Mission

Anomalous grooves on Martian moon Phobos explained by impacts

Test for damp ground at Mars' seasonal streaks finds none

Space tourists eye $150mln Soyuz lunar flyby

Roscosmos to spend $7.5Mln studying issues of manned lunar missions

Lockheed Martin, NASA Ink Deal for SkyFire Infrared Lunar Discovery Satellite

As dry as the moon

Scientists discover what extraordinary compounds may be hidden inside Jupiter and Neptune

New Horizons Spies a Kuiper Belt Companion

Pluto's Methane Snowcaps on the Edge of Darkness

Hunt For Ninth Planet Reveals New Extremely Distant Solar System Objects

New light on the complex nature of 'hot Jupiter' atmospheres

Discovery one-ups Tatooine, finds twin stars hosting three giant exoplanets

Could Proxima Centauri b Really Be Habitable

Rocky planet found orbiting habitable zone of nearest star

NASA Tests New Insulation for SLS Rocket

Orion Jettison Motor Fires to Ensure Crew Safety for the Journey to Mars

Specialized Transporters Move Core Stage of NASA's Space Launch System Rocket

India tests new scramjet rocket engine

China's newly-launched quantum communication satellite in good shape

China Sends Country's Largest Carrier Rocket to Launch Base

'Heavenly Palace': China to Launch Two Manned Space Missions This Fall

China unveils Mars probe, rover for ambitious 2020 mission

NASA set to launch near-Earth asteroid mission

Sulfur, Sulfur Dioxide and Graphitized Carbon Observed on Asteroid For First Time

Asteroid named for Freddie Mercury on 70th birthday

Ice Not a Major Factor of Dwarf Planet Ceres' Surface Features

Memory Foam Mattress Review
Newsletters :: SpaceDaily :: SpaceWar :: TerraDaily :: Energy Daily
XML Feeds :: Space News :: Earth News :: War News :: Solar Energy News

The content herein, unless otherwise known to be public domain, are Copyright 1995-2017 - Space Media Network. All websites are published in Australia and are solely subject to Australian law and governed by Fair Use principals for news reporting and research purposes. AFP, UPI and IANS news wire stories are copyright Agence France-Presse, United Press International and Indo-Asia News Service. ESA news reports are copyright European Space Agency. All NASA sourced material is public domain. Additional copyrights may apply in whole or part to other bona fide parties. All articles labeled "by Staff Writers" include reports supplied to Space Media Network by industry news wires, PR agencies, corporate press officers and the like. Such articles are individually curated and edited by Space Media Network staff on the basis of the report's information value to our industry and professional readership. Advertising does not imply endorsement, agreement or approval of any opinions, statements or information provided by Space Media Network on any Web page published or hosted by Space Media Network. Privacy Statement