The Unglamorous Thing That Makes AI-Driven Analytics Actually Work

There’s a lot of energy right now around what AI can do for analytics teams. Faster insights. Automated anomaly detection. Natural-language queries that replace hours of SQL. And it’s real. I’ve seen the workflows changing, and I find a lot of it genuinely exciting.

But there’s a question that isn’t getting asked loudly enough: what happens when the underlying data is a mess?

I’ve spent nearly two decades in this industry. I watched Universal Analytics age out, GA4 land with its own headaches, and now AI enter every conversation about where analytics is going. And the one constant through all of it is that most organizations are operating on data that isn’t clean, isn’t governed, and sometimes isn’t even internally consistent.

AI doesn’t fix this problem. It scales it.

The 80/20 Problem We Normalized

Before we can talk about AI honestly, we need to be honest about where most analytics teams are starting from.

The rough split in most organizations: around 80% of an analyst’s time goes to mechanical work. Pulling data, cleaning it, wrangling it, building the dashboard, answering the ad-hoc Slack. About 20% goes to actual thinking: diagnosis, strategy, recommendations. That inversion has always been a problem. We just normalized it because there wasn’t an obvious alternative.

The business is asking completely reasonable questions. “Why did users abandon the cart?” “What experience actually drove conversion?” “What should we change next?” These aren’t conceptually hard questions. The bridge to answering them at speed has always been the issue.

Add the metric traps we’ve built along the way. “Engagement is up.” “Conversion improved this quarter.” But engagement as a proxy for value? Clicks counted as customers? Pageviews celebrated as business success? These felt like answers. Often they were just the closest thing we could reach without harder work underneath.

This was the baseline: messy data, vague metrics, analysts buried in manual labor. Now AI has arrived and everyone wants to pour it on top of that baseline. And that’s where things get complicated.

What Happens When You Pour AI Into a Broken System

Here’s the part nobody talks about when AI gets announced as the future of analytics: language models have no immunity to bad data.

Let me give you a real example. When I was first at Google, I owned the analytics and A/B testing for the Google Apps for Business website, what’s now Google Workspace. My director of marketing came to me with what should have been a completely simple question: how many people came from House Ads (Google’s internal term for paid Google Search) and signed up for a trial of Google Apps for Business?

Now, this should have been straightforward. Go into Google Analytics, look at the UTM combination that pointed to House Ads, compare that to the event for Start Trial. Done.

Except it wasn’t done, because there were several different Start Trial buttons on the site, each built by a different developer at a different time. The event naming for one had a capital S. One had a lowercase s. One was two words, one was one word, one had a capital T. Multiple events, one action. And on top of that, the UTM data was equally inconsistent. Some PMMs had used “HA” for House Ads, some spelled it out, some used “paid Google” or “Google CPC.” Super inconsistent, to the point where answering this very reasonable business question required downloading a massive amount of data and trying to manually add together rows that may or may not have been the right combination of marketing channel and event.

As data analysts, we’re all pretty familiar with this problem. The capital S, the lowercase s, two synonyms for the same purchase event. We have the ingrained business knowledge to recognize that purchase_complete and complete_purchase are the same thing. We know the history. We know why it happened. We can figure it out.

The AI is not going to read that taxonomy the same way. It’s going to see two different events, and it’s not going to know it needs to combine them to get to the truth. It will give you an answer. It will give you a confident answer. And that answer will be wrong, because the underlying signal is fragmented and the model has no context to bridge the gap.

The numbers on this are pretty striking. A 2025 IBM Institute for Business Value report found that 43% of COOs now name data quality as their most significant data priority, and more than a quarter of organizations estimate they lose over $5 million annually because of it. BARC’s 2025 Trend Monitor, which surveyed nearly 1,600 data professionals, found that data quality as a top AI obstacle more than doubled in a single year, jumping from 19% of organizations in 2024 to 44% in 2025.

That’s not a stable problem. That’s an accelerating one. Now add AI systems operating at scale on top of it.

When an analyst manually queries bad data, the damage is controllable. They might notice something looks off. They might flag it. They have context. When an AI agent queries bad data, it synthesizes it, scales it, and distributes its conclusions at scale, and nothing in that process stops to ask whether the underlying data was actually right.

This isn’t a reason to be pessimistic about AI in analytics. It’s a reason to get serious about data quality before AI becomes the primary interface to your data.

Governance Is the Answer. And It Has to Be Built In.

Here’s where I think the path forward gets clear: data governance isn’t just an operational nice-to-have anymore. It’s the prerequisite for AI to work at all. And it’s becoming one of the most important things an analytics professional can own.

For a long time, governance got treated as a cleanup project. Something you did after things broke, or once a quarter when someone finally had time. A spreadsheet was created, beautifully organized, with all the right event definitions and taxonomy decisions. And then, within weeks, nobody was updating it, nobody was referencing it, and the implementation itself kept evolving further and further away from that document.

That model doesn’t work. It never really did. But with AI reading your data and returning answers to your stakeholders at speed, the cost of ungoverned data just went up dramatically.

Real governance has to be built into the system, not maintained alongside it. It means clean event schemas that are defined once and persist consistently across your implementation. It means a taxonomy that doesn’t produce five synonyms for the same action because five developers made independent decisions at five different moments. It means descriptions and definitions that live with the data, not in a doc that nobody opens. And critically, it means a human in the loop: an analyst who knows the business context, understands the history of the data, and can provide that context when the system needs it.

How Amplitude Is Approaching This

This is something I think about a lot in my role at Amplitude, and it’s something I believe the best analytics platforms are going to need to get right.

In Amplitude’s data section, governance is built into the product rather than bolted on after the fact. The platform actively surfaces issues in your taxonomy as they arise. If it detects that you have purchase_complete and complete_purchase being tracked as separate events, it will flag them: “We see you have these two events. Are they the same thing? If so, would you like us to merge them?”

It identifies missing descriptions and tags across your events and properties. It offers AI-generated suggestions for what those definitions should say, based on how the events are actually being used. You can approve them, modify them, or write something entirely different.

That might sound like a small thing. It isn’t. Because those descriptions and definitions don’t just live in a catalogue somewhere. They feed directly back into the model. When someone uses Amplitude’s AI features to ask a question about your data, the answers it returns are grounded in that governed, well-defined data layer. A taxonomy with clear, consistent event names and meaningful descriptions produces materially better AI outputs than one that’s fragmented and undocumented.

This is the direction I believe analytics tooling needs to move: governance as a continuous, embedded practice, not a one-time project. AI-assisted hygiene that helps you catch issues as they emerge, combined with a human analyst who has the business context to make the right calls when the system asks. The combination of both is what produces data you can actually trust.

And trust is the whole game now. Because if the data isn’t trustworthy, the AI answers built on top of it aren’t trustworthy either. And once you’ve lost stakeholder confidence in your analytics, it’s very hard to get back.

Where This Leaves the Analyst

So what does this mean for analytics professionals right now?

I think data governance is emerging as one of the two most important roles in the analytics space as AI continues to grow. It’s not the only role. But it’s the one that makes everything else possible.

The analyst who understands how data moves from source to query, who can define and maintain a clean taxonomy, who knows why the events are named the way they are and what the edge cases mean. That person becomes more valuable as AI becomes more central to how organizations access their data. Not less.

In a follow-up post, I’ll dig into both of the emerging analyst roles I see becoming increasingly important: the data governance role and the strategist role. But for now, the short version is this: the analysts who invest in building trustworthy data systems will be the ones who make AI work. And the ones who make AI work will be indispensable.

Category: Amplitude, Digital Analytics

The 80/20 Problem We Normalized

What Happens When You Pour AI Into a Broken System

Governance Is the Answer. And It Has to Be Built In.

How Amplitude Is Approaching This

Where This Leaves the Analyst

Leave a Reply Cancel reply