The Dirty Data Paradox

Malay Shah

November 26, 2024

The Dirty Data Paradox: Why So Little Data is Actually Used in Chemical Plants

Where does dirty data come from? If you’ve ever stepped inside a chemical plant, you’ve likely encountered it firsthand. Maybe it’s a broken sensor delivering erratic readings, or a spreadsheet with a mysterious error buried in a tab labeled “1165,” created by a supply chain planner who left the company six months ago.

Dirty data isn’t just inconvenient—it’s pervasive. And it explains why data practitioners famously spend 80% of their time cleaning and organizing data- a challenge that’s far more nuanced than it seems.

The Data Exhaust Problem

To put it bluntly, much of the data generated in process plants is useless. Or rather, it could be useful, but no one has the bandwidth to figure out how—not when they’re juggling 85 other priorities.

This issue goes beyond broken sensors and incomplete logs. Consider emails: nearly everyone in a manufacturing plant sends and receives them, yet few organizations even attempt to track “email data” for critical workflows like production planning and quality issues.

The reality is that only a handful of operational variables matter at any given time. The rest? It’s data exhaust.

Data exhaust—the byproduct of modern software and data collection practices—grows exponentially and forms the crux of the dirty data problem. Worse yet, there’s little incentive to fix it.

Why Dirty Data Persists

It’s estimated that 60% of bad data stems from human error. But here’s the kicker: if much of that data is exhaust, why would anyone prioritize cleaning it? For plant workers, spending time correcting or managing “seemingly useless” data simply doesn’t make sense when there are fires to put out—both figuratively and literally. Who cares what the monthly cycle time of Product ABC is when I have customer XYZ complaining about a delivery issue I need to drop everything to solve.

As a result, data exhaust and dirty data pile up, leaving organizations to grapple with a mess of low-quality inputs when they finally decide to optimize or innovate.

Can AI Agents Clean Up the Mess?

The road to meaningful AI-driven productivity in manufacturing starts with clean, reliable data. The challenge lies in cleaning and parsing data automatically, without pulling employees away from their primary tasks.

This is where recent innovations in AI Agents could shine. These tools can parse and contextually understand unstructured or semi-structured data—emails, PDFs, spreadsheets, and even handwritten notes—transforming “data exhaust” into actionable insights.

The beauty of AI agents is their ability to bridge the gap between messy inputs and useful outputs. By automating data extraction, cleaning, and categorization, they can provide structured, valuable datasets for other software tools to leverage.

Of course, humans still need to validate and refine these systems initially. But over time, AI agents can learn from human input, improving their accuracy and reducing the burden of manual data cleaning. In places like production scheduling and demand forecasting, we’re already seeing major benefits using this type of approach.

The Path Forward

Dirty data is an inevitable byproduct of modern chemical plants. However, we finally seem to have a set of technology to tackle the problem in a more automated approach.

The paradox of dirty data is that it doesn’t need to stay dirty forever. It just needs a little help from an “AI Janitor” to get clean.