Unifies your scattered data into one source of truth. Upgrades your existing models, dashboards, and queries into a causal semantic layer you didn't have to write. Picks up on trends and surfaces business insights, all wrapped in a quality harness that puts guardrails on the AI so the reports it generates stay on-spec.
Built forClickHouseandBigQueryfirst.·Snowflake · Databricks · others - WIPcontributors welcome ↗
New · LLM Wiki semantic layer
Your task board is already a semantic layer. dqt extracts it.
Dump tickets, SQL, and BI reports into raw/. Point Claude Code at the vault — it synthesises dataset descriptions, metric definitions, and causal edges into wiki/. No manual YAML authoring.
Based on Karpathy's LLM Wiki pattern ↗The hour after the alert
You set a threshold. It fires. Slack lights up. Now you're bouncing between dbt docs, the warehouse, and your BI tool — trying to figure out which upstream model changed, whether the spike in nulls explains the dashboard regression, and whether this is worth waking the on-call engineer for.
dqt was built for the part that comes after the alert. It reads your dbt manifest, parses your warehouse SQL into a column-level lineage graph, runs 35 statistical detectors and 29 declarative checks, and discovers causal relationships across your metrics — so the next time something moves, you already know what moved it.
Without dqt
Now what? Go dig through git log, dbt docs, warehouse history…
With dqt
Lineage: stg_payments → orders → revenue. Schema break in stg_payments 6h ago.
Causal candidate: stg_payments → orders.amount (E-value 3.2, pending human review).
Four layers. One library.
Statistical detectors
MAD, double-MAD, isolation forest, KS, STL residual z-scores, adjusted boxplot fences. Plus completeness, validity, freshness, schema-change, and SQL-assertion checks. Every detector returns the same (verdict, score, plain_english) shape.
mad_outlier_fraction · ks_pvalue · stl_residual_zscore · isolation_forest_fraction
Column-level lineage
dqt walks your dbt manifest and warehouse DDL with sqlglot to build a column-level dependency graph. From any incident, get an automatic blast radius — every downstream table and metric, ranked by exposure.
LLM Wiki · Semantic layer
dqt uses Karpathy's LLM Wiki pattern. Dump your Trello tickets, SQL files, and BI reports into raw/. Point Claude Code at the vault. It synthesises wiki/ — dataset descriptions, metric definitions, causal edges — from the artifacts your team already has. YAML contracts compatible with dbt's semantic_models.yml.
raw/tickets/ · raw/sql/ · raw/reports/ → wiki/metrics/ · wiki/lineage/
Causal discovery
dqt runs causal discovery across your metric time series, prunes edges with stability selection, and proposes directed metric→metric relationships annotated with lag, confidence, and E-values. Every edge reviewed by a human before entering the production DAG.
The only data questioning tool that ships causal discovery.
Every BI request your GTM team filed is a semantic definition waiting to be extracted. The ticket says what the metric means. The SQL says how it's computed. The report says what thresholds matter.
dqt uses Karpathy's LLM Wiki structure: raw/ for atomic source documents, wiki/ for synthesised knowledge. Point Claude Code at the vault and it writes the semantic layer for you — from the artifacts your team already has.
Read the full workflow guide →Export Trello tickets + attachments
SQL files, report HTMLs, metric definitions
Put them in raw/
raw/tickets/ · raw/sql/ · raw/reports/ · raw/schema/
Point Claude Code at the vault
cd vault && claude .
Claude Code synthesises wiki/
datasets, metrics, lineage, causal edges — grounded in your actual data
dqt generates per-column docs + checks
write_vault() · dqt run checks.yaml
Recommended workflow
Three plugins turn Claude Code into a grounded data questioning engineer that knows your warehouse, knows the dqt API, and can run checks from natural language.
Connects Claude Code to dqt's live documentation and source — all 64 detector and check slugs, the exact YAML schema, and adapter protocol. No training-data lag.
Gives Claude Code structured workflows for planning, executing, and reviewing multi-step tasks — essential for building out a full dqt check suite or semantic layer from scratch.
Each warehouse publishes its own MCP — e.g. mcp-clickhouse for ClickHouse, or the Postgres MCP server. Connect Claude Code to your warehouse and it can inspect live schemas, sample real distributions, and write dqt checks grounded in your actual data.
you
Look at my orders table and write dqt checks for amount_usd. Flag outliers and distribution shift.
claude code
Querying your warehouse via MCP... orders.amount_usd: right-skewed, p99=$4,820. Context7 confirms: use adjusted_boxplot_fraction (handles skew via medcouple) + ks_pvalue for drift. checks: - table_name: orders column_name: amount_usd detector_slug: adjusted_boxplot_fraction - table_name: orders column_name: amount_usd detector_slug: ks_pvalue params: {alpha: 0.01}
Full detector catalog
Three distinct problem domains: point outliers, distribution drift, and time-series anomalies. For any given column you typically need one or two — auto_outlier_fraction picks the right one automatically based on the data's distribution. Note: zscore_outlier_fraction assumes normality — use MAD or double-MAD on real warehouse data.
Statistical & ML algorithms · 35
Declarative checks · 29
Three lines to your first check.
from dqt import Check, Runner, MemoryStore
check = Check(
schema_name="public",
table_name="orders",
column_name="amount",
detector_slug="mad_outlier_fraction",
)
result = Runner(MemoryStore()).run(check, adapter)
print(result.plain_english)
# → "0.82% of values are outliers — within the 1% warn threshold"No server required. The optional FastAPI service and dashboard are there when you want them — and stay out of the way when you don't.
From zero to first incident.
Four steps. No database, no server. Runs in a notebook or a CI job — wherever Python runs.
Install
pip install dqtlib
Run your first check
from dqt import Runner, MemoryStore
from dqt.checks.models import Check
from dqt.adapters.local import LocalAdapter
import pandas as pd
df = pd.read_csv("orders.csv")
store = MemoryStore()
check = Check(
schema_name="public", table_name="orders",
column_name="amount_usd",
detector_slug="wasserstein_1", # drift detection
)
result = Runner(store).run_in_memory(
check,
reference=df[df.date < "2024-01-01"],
current =df[df.date >= "2024-01-01"],
)
print(result.verdict, result.plain_english)Read the result
verdict
pass · warn · fail
threshold decision
score
0.3142
raw metric (Wasserstein distance)
plain_english
"Distance 0.31 — above warn threshold"
human-readable summary
Open the dashboard
pip install "dqtlib[dashboard]" # adds FastAPI + uvicorn dqt dashboard --port 8080 # → http://127.0.0.1:8080
Checks, column distribution profiles, and Granger causality inference — all in one place. No signup, no cloud, no persistent state beyond the process.
Drop it in next to the tools you already use.
Open source · MIT licensed · Python 3.12+ · No telemetry · No signup · No credit card
About the author
Anton Barr is an engineer and data geek with 25+ years building data systems. A student of 質 (shitsu): quality, substance, the inner nature of a thing. dqt is a personal project built by a practitioner who believes craft and precision are the same thing - and got tired of tools that answer what but never why.