Back to Blog

Hardest Part Of AI Is Cleaning Up Your Data - Tips From Experts

By Christopher Steiner  •  Jun 29, 2017

As more tools become available to create AI models, it has become easier for companies to harness the power of machine learning for their applications. What once required deep domain expertise to execute has been made easier by libraries and frameworks, such as Google's TensorFlow

To be clear, none of it is 'easy,' but it may well be that the hardest part of the AI equation is acquiring, wrangling and, perhaps most poignantly, cleaning the data required to do the job. Engineers without experience in AI may well underestimate the time and effort required to get data to a point where AI will make the greatest impact, where the model will be as powerful and predictive as it can be. 

We talked to many data scientists and engineers who estimated that, on a given AI project, corralling, moving (these datasets can be unwieldy in their size), checking and organizing the data often comprises 70% to 80% of the time spent on a project. Setting up the model and building it can often form the shorter backend of a job.

With that in mind, we used this insight from experienced AI hands to put together tips and tactics on moving, cleaning and preparing data.

In short form (more details below), our findings:

  • When the data are impossibly large, sometimes it's best to move the algorithms, not the data
  • Even companies with massive, clean proprietary data stores will need to spend time massaging it for AI
  • Dark and unstructured data shouldn't be ignored
  • It's best if you actually look at your data, the earlier the better
  • Automate inspection - even include AI in the process 
  • Embrace automation, but don't automatically dismiss blanks and null values
  • Use other AI models to crawl data as it comes in
  • Examine data for bias, expunge it and keep it out of model

Moving data

Transporting big sets of data for building complicated machine learning models can require moving the data offline, in physical form. The Web, in some cases, just doesn't offer enough speed. A terabyte or two can be moved easily enough across normal channels, of course, but things can get clunky when the data involved reaches multiple petabytes or even exabytes. At this point the data will likely need to be transported physically. 

Think through how to best facilitate the movement of the data while also ensuring its redundancy and the ability to process it. 

While intuition might suggest that more data is always better, processing such vast amounts can prove not only hard to do because of the physical location of the data, but also because of the time and processing power required. 

Using the web for processing power or for transport can get spendy, as cloud solutions can be prohibitively expensive to build AI models, something I wrote on recently.

Sometimes it's best not to think about moving the data, but about moving the method—the machine, the processing units—that parse the data and build the model.

"Advanced systems typically move algorithms to where the data is, rather than moving the data to the algorithms," points out Siddhartha Agarwal, Vice President of Product Management & Strategy at Oracle. 

Engineers should still seek to use as much data as possible. AI algorithms are good at finding what is relevant and what is not, so it doesn't hurt to err on the side of feeding more data to these algorithms.

Serendipitous discoveries of unknown correlations are more likely to occur when a model is built with more data rather than less, Agarwal says.

Even companies with massive proprietary data stores will need to spend time massaging it for AI

Most applications and databases in use today, especially those that are chock full of records, weren't built with the specter of AI looming as it does now. So it's often the case that applications are allowed to write non-standard data to their databases, which may be rife with blanks, misspellings, and non-standard entries. This doesn't pose much of a problem when the data only has to function as expected in a relational database, where it's usually queried in small groups or by itself. 

But when it has to be drawn out and dumped into a new place, with new expectations on it, engineers may find out that their data asset, which was assumed to be so valuable, isn't quite AI-ready.

One common problem for big companies with piles of data is that they have it stored in silos, with little connective fiber between each database. 

"Many companies use a separate CRM platform, a customer service platform, and an email campaign management platform," points out Chris Matty, the CEO and co-founder of Versium, which uses AI to do predictive analytics. "While all of these platforms are beneficial to the business, they are disparate systems, which can cause several issues from an analytical perspective." 

Data silos may result in duplicate information, some of which may correspond; some of which may contradict. Data silos can also limit a company's ability to derive quick insights from their internal data. 

This common circumstance leads companies and data scientists to do far more work merging datasets and getting all of the fields to agree with each other. This can be accomplished with scripts to build bigger tables with assumptions, but building these things takes time and consideration from those who know the data best.

Rushing into AI without realizing the constraints that previous data practices and collection have placed on a startup or company will lead to projects that balloon in cost and time—so be sure to budget for these kinds of issues, especially with legacy datasets. 

Just as important, keep all of these things in mind when building new applications and new data structures. Ensuring clean data now will pave the way for easier AI analysis later.

Dark and unstructured data shouldn't be ignored

Data is available in many forms and shapes. Gabriel Moreira, the lead data scientist at CI&T, a digital agency, says that 80% of organization data is unstructured. This is stuff such as logs, documents, images and other media types. This ‘dark’ data is harder to analyse than structured data, because they do not provide a necessary level of organization, and some of it may not be stored in traditional labeled databases. 

But just because it's harder to analyze does not imply it's useless data. 

"There are usually many hidden opportunities in the haystack," Moreira says.

For example, web server logs may be used to understand users’ journey across a website, to model user preferences and even to provide personalized recommendations. Scanned documents images might be digitalized by OCR, and Natural Language Processing techniques may provide a big picture of the processes that collected those documents. Call centers recordings could be transformed into text to analyse the main motivation of the calls and the tone of conversations. Webcams on stores might be used to assess customers' satisfaction when they are browsing, and airport cameras may be used to automatically detect suspect behaviour. 

Leveraging this kind of data requires extra care and time, and special processes to ensure that data is straight and clean. The process to translate, parse and organize disparate data types for a task like this will likely comprise much of the job for an experienced AI model builder, but the payoff is worth it.

Putting together structured and unstructured data can lead to more powerful models with higher degrees of accuracy and usefulness. Building models in this way requires more diligence, more steps and more time. But it's these kinds of constructs that can lend one company a proprietary AI edge that can be difficult—and not intuitive—for other companies to match. 

Look at your data

Organizing and cleaning up data is the least glamorous part of the AI mission. But it must be done.

A good place to start, recommends Amanda Stent, natural language processing architect at Bloomberg, is actually looking at the data, or at least some of it. 

Stent had a task earlier in her career that involved identifying the temporal ordering of events (i.e., whether Event A occurred before or after or during Event B). A data set was provided, but Stent's team couldn't trace an obvious baseline for this task using this data. 

After a couple of weeks, she actually examined the data and discovered that the logical completion of the temporal links had not been made in the evaluation or training data - so that if Event A actually had occurred before Event B, the data had not been labeled that Event B came after Event A—so there was little chance the model would ever find that relationship.

"Two weeks was entirely too long to go without looking at the data," Stent says. "Make sure to look at your data early on, before spending weeks driving yourself crazy with model and feature engineering."

Cleaning the data may also mean adding to it. In the project that Stent mentioned, some programming was all that was required to clean the data, adding labels where necessary. She's worked with other datasets, however, where the fix wasn't so easy, where missing fields for some items required engineers to interpolate values and fill-in where necessary. Some models can deal with blanks better than others, but it's best to ensure a totality of completeness and quality.

"Garbage in, garbage out," Stent says.

Automate inspection - even include AI in the process 

Where possible, engineers should write scripts that can check if the data falls within the required specs. This means ensuring things like dates, times and zipcodes conform to standard conventions.

It's best if these scripted functions have some flexibility to them, so that they be easily adjusted according to the dataset. Building them this way allows engineers to reuse the scripts, and to assemble a library of methods that will allow for the faster processing of data and, ultimately, better AI models that have been built with data that's cleaner and more relevant.

It's during this step when previously-built AI models can help. In effect, engineers will train AI to process data to then again train AI. Building the first models and methods will be tedious, but they can be used over and over again, and will greatly help the eventual models be far more precise.

"Data quality at scale is an excellent application for AI, and probably the only way to move the bar significantly forward there," says Massimo Mascaro, director, data engineering and data science at Intuit, the maker of Quick Books and Turbo Tax.

Intuit is using AI to look for anomalies and outliers in data and then flag those items for inspection. The next big step in that process will be automatic issue resolution, where the machine will automatically fix or decide to discard anomalous data. Eventually, Inuit wants to push AI out toward the UI of their product, so users can be prompted immediately if their entry (usually tax or income data, which can be painful to get wrong) doesn't seem correct. Getting AI to that position will inherently make Intuit's data cleaner and keep the IRS sated.

Beyond automated methods, which should be in every AI engineer's toolbox, those creating AI models can tap services provided by the rising class of TDaaS firms, who offer a way to get outsourced human eyeballs to examine data and even to help in creating it, at an affordable price.

Learning what these firms are good at and what they're not good at is a nuanced process that will require trial and error, but it will reward those who do it, as they'll cultivate a process with cheaper human inputs, which are often required when getting data into the fold.

Embrace automation, but don't automatically dismiss blanks

Sometimes a blank can signify carelessness or an error, but blanks, given context, can also signal that the user was indicating something else. That's an important distinction, one that has to be considered dataset to dataset. 

In the case of employment dates, a blank end date on a person's position at a company can often mean they're still in that job, points out Mark Goldin, the CTO of Cornerstone OnDemand, a cloud platform for recruiting and managing employees. 

A blank description for a class description, however, is simply missing data. "Depending on the application, we can either still use the data without that particular value, throw out rows with bad data or assume some sort of average or predicted data instead," Goldin says.

Godlin's company has built a custom inspection tool it calls Datascope to determine the quality of a given dataset. It produces a red, yellow or green assessment in grading the data quality and quantity for every customer and every AI data source they consider.

Christopher Steiner is a New York Times Bestselling Author of two books, the founder of ZRankings, and the co-founder of Aisle50 (YCS11), acquired by Groupon in 2015.