“Is this data any good?” — answered in one click
Before you build a report, train a model, or hand a dataset to a stakeholder, someone has to answer the most boring and most important question in data work: is this data actually any good?
That question used to take hours. You’d open a notebook, write a few queries, copy the output into a doc, write a verdict. By the time you finished, the data had often changed.
Exploratory Data Analysis (EDA) in VDF AI Data does it in one click. You pick an asset, hit Run analysis, and a few moments later you have a complete read on what’s healthy, what’s risky, and what’s quietly drifting.
Run EDA the moment you start trusting an asset. Not when something breaks. Five minutes of upfront EDA tells you what to expect — and turns "the data looks weird today" into "the data drifted in this column three weeks ago, and here's why."
Who this is for
EDA is written for people who need to make decisions about data, not just compute on it:
- Analysts sizing up a new dataset before building on top.
- Product and operations leads validating that a source is healthy enough to depend on.
- Workspace admins auditing what counts as a “golden source” and what doesn’t.
- Anyone joining a team who needs to understand a dataset they didn’t build.
You don’t need to write SQL. You don’t need to know statistics. The screens explain themselves.
What EDA gives you
An EDA run produces two things: a summary for the whole table and a profile for each column.
The table-level summary
A short scorecard that captures the shape of the dataset at a glance.
Missing values
What share of cells across the table are empty. Low is good. A sudden jump from one run to the next is a signal something changed upstream.
Duplicate rows
What share of rows are exact duplicates of another row. Often surprising — and often the first thing your downstream report needs to know about.
Outlier columns
Columns where the values look unusual relative to history — sudden spikes, unexpected nulls, value-set changes.
Class imbalance
For categorical columns, whether one value dominates so heavily it would distort downstream work. Common in fraud, churn, and rare-event data.
Overall quality score
A single rolled-up number you can sort and compare on. Same scoring used on the asset card after [discovery](/docs/products/vdf-ai-data/discovering-your-data).
Last run
When EDA was last run on this asset. Old analysis on a fast-moving table is a yellow flag of its own.
The column profile
For each column, EDA produces a short profile.
| Field | What it tells you |
|---|---|
| Type | Numeric, datetime, categorical, or text — inferred from the actual values. |
| Missing % | What share of rows have no value in this column. |
| Unique % | How many distinct values the column holds. 100% means every row is unique (an ID or a timestamp). 1% means a small number of repeated values (a status, a country). |
| Distribution summary | For numeric columns: min, max, mean, standard deviation. For categorical columns: the top values and their share. |
| Drift signal | A simple low / medium / high indicator: how much the column’s shape has shifted since the last EDA run. |
Reading a drift signal
Drift is the most often-misunderstood concept in EDA. It’s also the most useful.
Drift means: “the shape of this column has changed in a way that’s worth noticing.” Not necessarily worth panicking. Just worth noticing.
A few examples that make it concrete:
- A
countrycolumn that used to be 60% US starts trending 40% — a marketing campaign is landing somewhere new. - A
latency_mscolumn where the average doubled overnight — a deployment broke an upstream dependency. - A
subscription_statuscolumn where “cancelled” is rising — there’s a churn problem brewing.
VDF AI Data summarizes drift in three buckets:
| Signal | What it means | What to do |
|---|---|---|
| Low | The column looks like itself | No action needed |
| Medium | The shape has moved, not dramatically | Open the column profile; check if it tracks a known event |
| High | The shape has changed substantively | Investigate before depending on this column |
Drift isn't always bad. Sometimes drift is exactly what you wanted to see — a campaign worked, a fix shipped, a market expanded. EDA tells you something changed. You decide whether that's good news or bad.
How to read an EDA report
Three quick patterns:
When the score is high
If the summary score is high and no column is flagged high-drift, you can confidently build on this asset. Spot-check the columns you’ll depend on most — distributions, unique counts — and you’re done.
When the score is mid-range
Open the column profile and look for one of two patterns: missing-value clusters (a chunk of columns where missingness suddenly rose) or drift clusters (multiple columns shifting in the same way). Either pattern usually points to an upstream change you’ll want to understand before depending on this asset.
When the score is low
Don’t assume the data is broken — assume it’s misunderstood. Talk to the owner. The most common cause of a low-score new connection isn’t bad data — it’s that the discovered scope included a few staging tables, or a table that’s expected to be sparse, or an experimental dataset that hasn’t been cleaned yet. Excluding a handful of assets often fixes the perceived quality at the source.
When to re-run EDA
EDA is cheap. Run it when something changes — and run it on a cadence for the assets that matter most.
- On a new asset. Always. EDA is your first read.
- After an upstream change. A new column, a renamed table, a refreshed load.
- Before a major decision. Building a new report, training a new model, handing a dataset to an external team.
- On a calendar for golden sources. Weekly or monthly EDA on the dozen assets your team relies on most.
What EDA does not do (and what to use instead)
EDA is great at telling you about the shape of the data. It’s not great at telling you:
- Whether the data is correct. A column can be high-quality, low-missing, and structurally pristine — and still record the wrong value. Correctness is a domain question, not a statistical one.
- Why drift happened. EDA flags drift. Investigating the cause means looking upstream — at deployments, schema changes, business events.
- What to build with the data. EDA tells you what you have. Deciding what to do with it is the next step — see Features and relationships.
A short EDA habit that pays off
A pattern that works on most teams:
- On every new connection, run EDA on the top 5 assets.
- On every Monday morning, glance at the EDA dashboard for your team’s golden sources. Drift signals jump out fast.
- On every quarterly review, run a fresh EDA pass on your full asset list — and update tags or comments for anything that moved.
Ten minutes, three habits, and your team always knows what’s healthy.
Where to go next
- Features and relationships — turn what you found into something the team can use.
- Discovering your data — the step before EDA: surfacing what you have.
- Vector indexes and semantic search — make text columns searchable once you trust them.
- Connecting databases — for the moment you need a new source.