Deep Research OS — the pipeline
Most "AI research" is one model summarising its training data. This is the opposite: a multi-model pipeline where every claim cites a URL fetched at runtime, every number is computed in a Python sandbox, every section survives a adversarial review before merge, and the whole thing renders as a static HTML bundle a CTO can audit in a browser.
1 · The problem
Three failure modes dominate AI-generated research:
- Single-model hallucination. One model invents a number, polishes it into a confident sentence, and the deck moves on. The reader cannot distinguish a sourced figure from a generated one.
- LLM arithmetic. Even when the source is real, the model rounds, transposes, or invents during multiplication. Compounded across a financial model, the output is unverifiable.
- "Better UX" as moat. Without an explicit defensibility taxonomy, every roadmap reduces to a list of features. There is no test for whether the value compounds.
The output looks plausible, falls apart on a CTO's first probe, and nothing in the build pipeline could have caught it.
2 · The thesis
Treat research as a build pipeline. Apply software-engineering discipline to the artefact: schemas, fail-fast gates, immutable canonical state, isolated calculation, multi-perspective code review, regression checks. The model is a worker; the pipeline is the product.
Every error becomes a rule. When a sub-agent fabricates a fact, fails verification, or drifts into wish-based language, a regression check is added to the pipeline. No retries without a new rule.
3 · Pipeline architecture
A reasoning sandwich. Strategy and adversarial review use the most expensive model; deep search is parallelised across three different RAG strategies; structured extraction is delegated to the cheapest model that passes a schema check; calculation never touches an LLM.
Reconnaissance
Model: Claude Opus 4.7 with extended thinking. Output: a research blueprint that decomposes the prompt into atomic jobs with model assignments per job. Failure mode caught here: scope creep, missing falsifiability, fabricated segments.
Parallel deep search
Workers: Perplexity Sonar Deep Research, Alibaba Tongyi DeepResearch 30B, OpenAI o4-mini deep-research — all three queried with the same questions. Output: citations with quotes, archived to evidence/<run-id>/_research_arms/. Cross-arm disagreement is preserved (never averaged) and surfaced to synthesis.
Structured-extraction swarm
Model: Claude Haiku 4.5, parallel sub-agents. Output: JSON/YAML matching strict schemas — peer cards, job stories, value mechanics, capability map, geo bands. Unknown values are null with a row in data gaps; fabrication of a number is a pipeline failure.
Python sandbox
Every numeric claim in the final brief is the output of executed Python. The financial model writes a financial_calc.py file, runs it via subprocess, and the result is the only number that lands in the output JSON. The flag calculation_method == "python_subprocess_executed" is a hard gate.
Adversarial review
Model: Claude Opus 4.7 in a hostile-investor frame. Output: minimum 3 HIGH-severity issues per run. If the adversarial review finds fewer, the agent is re-invoked with stricter instruction — being too diplomatic is itself a failure.
Canonical brief
Model: Claude Opus 4.7. Output: canonical_brief.json — the immutable downstream label set. From this point, no agent (deck assembler, translator, renderer) may use a number that contradicts the canonical brief.
Static HTML
Mobile-first vanilla HTML+CSS, no framework, no build step. Output: reports/final/*.html. Playwright audit harness verifies metrics on iPhone 13, Pixel 7, iPhone SE, iPad Mini, desktop 1280 before the bundle is considered green.
4 · Model routing
Picking the right model for each job is the cost-quality lever. The default is Claude Haiku 4.5 — Claude Opus 4.7 is the exception, used only where the work is irreplaceable.
| Stage | Model | Reason |
|---|---|---|
| Plan / strategy | Claude Opus 4.7 (extended thinking) | Highest-leverage cognitive moment; mistakes here cascade. |
| Web fan-out (×3) | Sonar Deep Research; Tongyi DeepResearch; o4-mini deep-research | Three different RAG strategies on the same questions; cross-check surfaces disagreement. |
| Structured extraction | Claude Haiku 4.5 (parallel) | Schema-bound work doesn't need frontier reasoning; Claude Haiku 4.5 is 90% of the quality at ⅓ the cost. |
| Synthesis / framing | Claude Opus 4.7 (extended thinking) | Cross-document insight generation requires the best model. |
| Adversarial review | Claude Opus 4.7 (extended thinking, max budget) | Adversarial reasoning is the second-most expensive cognitive task; do not skimp. |
| Translation, render | Claude Haiku 4.5 | Schema-driven work with a pre-built glossary. |
| Free fallback | NVIDIA Nemotron series (free tier) | First-pass scans only; never used for final synthesis. |
5 · Quality gates
The pipeline runs two layers of gates. Pipeline-phase gates check each research stage as it produces JSON; content-and-SEO gates check the published HTML before deploy. Failure becomes a new rule — the gate file at scripts/check_quality.py is the live regression log.
Pipeline-phase gates (research output)
Completeness ≥ 60%, content type set, target geographies non-empty. Fail action: surface clarification questions to the user.
Every numeric claim has a _source URL; the URL was fetched and returned HTTP 200; the response body is archived. Fail action: re-invoke the research agent with the missing-field list.
calculation_method == "python_subprocess_executed"; pessimistic < base < optimistic; revenue percentages sum to 1.0 ± 0.001. Fail action: halt the pipeline, require explicit confirmation.
At least three HIGH-severity issues identified. Below threshold means the critic is being too diplomatic; re-invoke with stricter framing.
An entry in regulatory_compliance for every priority geography. Fail action: re-invoke the risk assessor with the missing-geographies list.
Every section heading non-null; no placeholder strings; financial figures reference canonical brief values verbatim.
All output files present and non-empty; HTML files contain the expected number of sections; URLs resolve to HTTP 200.
Content + SEO gates (published HTML)
Fourteen gates run on every page in reports/final/ via scripts/check_quality.py. The full set runs locally with make audit and on every pull request via .github/workflows/quality-gates.yml.
Exactly one <h1> per page.
Block timestamped run identifiers in published prose.
Block job-application framing leaking into research voice.
Block any reference to the deleted JD coverage page.
Block "stop test", "exit criteria", "adversarial review" jargon variants in user-facing prose.
Block specific unverified marketing badge phrases.
Block literal <https://...> markdown auto-link text.
Block internal codes like , in prose.
Required <head> tags present (title, canonical, OG, Twitter, robots).
JSON-LD structured data present and parseable.
Every <img> has alt + width + height (CLS prevention).
Every internal href resolves to an existing file.
Every page carries the same nav structure.
Site-wide files (robots.txt, llms.txt, sitemap.xml, etc.) present and valid.
The gate suite is unit-tested at tests/test_check_quality.py (51 tests, ~72% coverage on the gate modules). When a new failure is found in production, the failure becomes a new gate. No retry without a new rule.
6 · The evidence chain
Every claim in the strategic brief traces back to an artefact in evidence/<run-id>/. The Evidence Map renders the trace human-readable. Three discipline rules govern the chain:
- Citation requirement. Every numeric claim in any research output has a
_sourceURL field. No URL → the claim is dropped. - Quote ≤ 15 words. Translations are stored separately; the original verbatim is preserved. Mistranslation is detectable.
- Null beats fabrication. Any agent that cannot find real data writes
nulland adds an entry todata gaps. A null with documentation is valid; a fabricated number is a pipeline failure.
The whole pipeline state lives in flat JSON/YAML. diff works. git blame works. There is no opaque vector store or graph database to consult.
7 · Agent specialisation
Six specialist agents in the project's .claude/agents/ directory, each with a narrow contract. Routing decisions and effort budgets are documented in QUALITY_BAR.md.
Runs parallel fan-out across Sonar Deep Research, Sonar Pro, Tongyi DeepResearch, and o4-mini. Cross-validates free arms with paid models.
Extracts the peer set from research arms, deduplicates, tags discovery method and verification timestamp.
Surfaces at least three HIGH-severity issues plus at least five MEDIUM-severity. Includes fabrication-risk detection and survivorship-bias check. Runs on Claude Opus.
Writes the immutable canonical brief; respects the cross-document consistency contract.
Constructs persona archetypes from peer cards and segment data.
Renders the final HTML with mobile-first design, JSON-LD, and Playwright-audited layout.
Five new skills (AI-search optimisation)
The session that produced this rewrite added five reusable skills (loaded by Claude Code from the user's ~/.claude/skills/ directory) and two universal rule files:
Generative Engine Optimisation — make content cited by ChatGPT Search, Perplexity, Gemini.
AI Overview Optimisation — land in Google AI Overviews. Direct-answer paragraphs, query fan-out, quarterly freshness.
Answer Engine Optimisation — voice search, featured snippets, "People also ask".
LLM Optimisation — crawler access, llms.txt spec, brand-entity reinforcement, the AI-crawler reference table.
Schema.org reference — the five JSON-LD types covering 80% of cases plus universal head-tag and JSON-LD templates.
Universal rule files ai-search-optimization.md and content-quality-gates.md in ~/.claude/rules/common/.
The full research note backing those skills lives at evidence/research/ai-search-optimization/SUMMARY.md: 4,696 words, 67 unique sources, primary docs anchoring the load-bearing claims (Princeton GEO paper, llmstxt.org, OpenAI bot docs, Anthropic privacy page, Schema.org type pages).
8 · Frontend engineering
The output is the audit. A static HTML bundle a reviewer can open in a browser, in private mode, with DevTools open. No build step, no framework, no external resources.
Mobile-first design system
- Fluid typography via
clamp()— no breakpoint cliffs. - Mobile (< 720px): floating action button bottom-right opens a bottom-sheet drawer driven by
<details>— works without JavaScript. - Tablet / desktop (≥ 720px): sticky always-visible side rail, never scrolls out of view.
- WCAG 2.5.5 — 44 px tap targets on standalone controls; the audit script distinguishes inline-prose links (exempt) from standalone CTAs (required).
- Dark mode, safe-area insets (
env()),prefers-reduced-motion, print stylesheet.
Audit harness — Playwright
The harness in scripts/audit_mobile.mjs emulates 5 device profiles (iPhone 13, Pixel 7, iPhone SE, iPad Mini, desktop 1280) and captures: TTFB, FCP, LCP, CLS, transfer bytes, document height, horizontal overflow, console errors, network failures, viewport screenshots, full-page screenshots, and a tap-target audit that distinguishes inline-prose vs standalone controls.
9 · Multi-perspective audit
Before publication the bundle was audited by simulated specialists running in parallel — QA, ML, Data Science, SWE, DevOps. Each surfaces a different class of failure:
- QA — reproducibility, gate coverage, regression-rule additions after each failure.
- ML — hallucination prevention, confidence scoring, model-routing justification.
- Data Science — number traceability, source tier separation, null discipline.
- SWE — code idiom, semantic HTML, accessibility.
- DevOps — deployment readiness, security headers, secrets handling, .gitignore hygiene, smoke-checks.
Audit reports live alongside the run artefacts in evidence/_audits/. The discipline is the same as the adversarial review: an audit that says "everything passes" is itself a failure.
10 · AI search optimisation
The site is engineered to be cited by AI search surfaces (ChatGPT Search, Perplexity, Google AI Overviews, Claude). The four 2026 paradigms — Generative Engine Optimisation (GEO), AI Overview Optimisation (AIO), Answer Engine Optimisation (AEO), and LLM Optimisation (LLMO) — are summarised in the cross-cutting reference at evidence/research/ai-search-optimization/SUMMARY.md.
Per-page
- Forty-to-sixty word citable summary lead under H1.
- JSON-LD
@graphwithArticle+Organization+WebSite+BreadcrumbList+Person. - Open Graph + Twitter Card meta tags.
- Canonical URL, language attribute, robots directive permitting full snippet reuse.
- Mobile-first hamburger drawer that expands to a horizontal row at ≥ 880 px (WCAG 2.5.5 — 44 px tap targets).
Site-wide assets
Generated deterministically by scripts/build_seo_assets.py:
AI-crawler-aware allow / disallow per the 2026 reference. Allows GPTBot, OAI-SearchBot, ChatGPT-User, the three Anthropic bots, PerplexityBot, Google-Extended, Applebot-Extended; blocks Bytespider and the deprecated Anthropic agents.
Concise machine-readable index per the llmstxt.org spec — H1, blockquote summary, H2-grouped link list with one-line annotations.
Full plain-text concatenation of every page's body. AI agents visit it ~2× more than llms.txt (Semrush 2026).
Standard XML sitemap with lastmod from file mtime, priority, change frequency.
RSS 2.0 feed of every report page.
RFC 9116 contact + policy file under /.well-known/.
Minimal PWA manifest. Sets the brand colour, icon, language, and category.
Human-readable team and credit file.
11 · Repository & engineering quality
The full source is at github.com/avaluev/padel-market-analysis — every script, every prompt, every test, every CI workflow. Apache 2.0.
Layout
scripts/— pipeline builders, content sanitiser, SEO asset generator, quality gate, mobile audit.tests/— pytest suite for the new gates (51 tests).reports/final/— the published site (this directory is what GitHub Pages serves).evidence/— raw research output, audit reports, the AI-search research summary.QUALITY_BAR.md— the rules every agent obeys..github/workflows/— CI: deploy + quality-gates.
Engineering quality
- Typed Python —
mypy --strictonscripts/andtests/; configured in pyproject.toml. - Linted + formatted — ruff for both lint and format; runs as a pre-commit hook.
- Tested — 51 unit tests, ~72% coverage on the gate modules; tests required for every new gate.
- CI gated — quality-gates.yml blocks merge to
mainon lint, typecheck, test, content-quality, link-verify, secret-scan failures. - Secret scanning — gitleaks as both a pre-commit hook and a CI job.
- Reproducible builds — Makefile with
make install,make test,make audit,make build; idempotency check in CI. - Dependency hygiene — Dependabot weekly runs on Python, npm, GitHub Actions.
Local audit
Reproduce the full pre-deploy gate set in one command:
git clone https://github.com/avaluev/padel-market-analysis.git
cd padel-market-analysis
make install # python dev deps + node deps
make audit # build + content + SEO gates
12 · What this scales to
The pipeline is domain-agnostic. The Padel AI Coach run is a worked example, not the product. Replace the input prompt and the same architecture produces:
- An investor-grade brief on any market opportunity.
- A competitive landscape with verified peer cards and moat analysis.
- A capability map for any team — what ships solo, what needs a partner, what needs capital.
- A risk register with regulatory coverage by geography and named pivot triggers.
Where to put it next: continuous research (run on a cron, diff the canonical brief, alert on material changes); enterprise customers (replace OpenRouter with self-hosted endpoints, swap the evidence store for Postgres+pgvector if the run cardinality demands it); decision support (treat the canonical brief as a structured input to product-roadmap and capital-allocation decisions).