Methodology · How this was built

Deep Research OS — the pipeline

Most "AI research" is one model summarising its training data. This is the opposite: a multi-model pipeline where every claim cites a URL fetched at runtime, every number is computed in a Python sandbox, every section survives a adversarial review before merge, and the whole thing renders as a static HTML bundle a CTO can audit in a browser.

1 · The problem

Three failure modes dominate AI-generated research:

Single-model hallucination. One model invents a number, polishes it into a confident sentence, and the deck moves on. The reader cannot distinguish a sourced figure from a generated one.
LLM arithmetic. Even when the source is real, the model rounds, transposes, or invents during multiplication. Compounded across a financial model, the output is unverifiable.
"Better UX" as moat. Without an explicit defensibility taxonomy, every roadmap reduces to a list of features. There is no test for whether the value compounds.

The output looks plausible, falls apart on a CTO's first probe, and nothing in the build pipeline could have caught it.

2 · The thesis

Treat research as a build pipeline. Apply software-engineering discipline to the artefact: schemas, fail-fast gates, immutable canonical state, isolated calculation, multi-perspective code review, regression checks. The model is a worker; the pipeline is the product.

Every error becomes a rule. When a sub-agent fabricates a fact, fails verification, or drifts into wish-based language, a regression check is added to the pipeline. No retries without a new rule.

3 · Pipeline architecture

A reasoning sandwich. Strategy and adversarial review use the most expensive model; deep search is parallelised across three different RAG strategies; structured extraction is delegated to the cheapest model that passes a schema check; calculation never touches an LLM.

Stage 1 · Plan

Reconnaissance

Model: Claude Opus 4.7 with extended thinking. Output: a research blueprint that decomposes the prompt into atomic jobs with model assignments per job. Failure mode caught here: scope creep, missing falsifiability, fabricated segments.

Stage 2 · Fan-out

Parallel deep search

Workers: Perplexity Sonar Deep Research, Alibaba Tongyi DeepResearch 30B, OpenAI o4-mini deep-research — all three queried with the same questions. Output: citations with quotes, archived to evidence/<run-id>/_research_arms/. Cross-arm disagreement is preserved (never averaged) and surfaced to synthesis.

Stage 3 · Extract

Structured-extraction swarm

Model: Claude Haiku 4.5, parallel sub-agents. Output: JSON/YAML matching strict schemas — peer cards, job stories, value mechanics, capability map, geo bands. Unknown values are null with a row in data gaps; fabrication of a number is a pipeline failure.

Stage 4 · Calculate

Python sandbox

Every numeric claim in the final brief is the output of executed Python. The financial model writes a financial_calc.py file, runs it via subprocess, and the result is the only number that lands in the output JSON. The flag calculation_method == "python_subprocess_executed" is a hard gate.

Stage 5 · adversarial review

Adversarial review

Model: Claude Opus 4.7 in a hostile-investor frame. Output: minimum 3 HIGH-severity issues per run. If the adversarial review finds fewer, the agent is re-invoked with stricter instruction — being too diplomatic is itself a failure.

Stage 6 · Synthesise

Canonical brief

Model: Claude Opus 4.7. Output: canonical_brief.json — the immutable downstream label set. From this point, no agent (deck assembler, translator, renderer) may use a number that contradicts the canonical brief.

Stage 7 · Render

Static HTML

Mobile-first vanilla HTML+CSS, no framework, no build step. Output: reports/final/*.html. Playwright audit harness verifies metrics on iPhone 13, Pixel 7, iPhone SE, iPad Mini, desktop 1280 before the bundle is considered green.

4 · Model routing

Picking the right model for each job is the cost-quality lever. The default is Claude Haiku 4.5 — Claude Opus 4.7 is the exception, used only where the work is irreplaceable.

Stage	Model	Reason
Plan / strategy	Claude Opus 4.7 (extended thinking)	Highest-leverage cognitive moment; mistakes here cascade.
Web fan-out (×3)	Sonar Deep Research; Tongyi DeepResearch; o4-mini deep-research	Three different RAG strategies on the same questions; cross-check surfaces disagreement.
Structured extraction	Claude Haiku 4.5 (parallel)	Schema-bound work doesn't need frontier reasoning; Claude Haiku 4.5 is 90% of the quality at ⅓ the cost.
Synthesis / framing	Claude Opus 4.7 (extended thinking)	Cross-document insight generation requires the best model.
Adversarial review	Claude Opus 4.7 (extended thinking, max budget)	Adversarial reasoning is the second-most expensive cognitive task; do not skimp.
Translation, render	Claude Haiku 4.5	Schema-driven work with a pre-built glossary.
Free fallback	NVIDIA Nemotron series (free tier)	First-pass scans only; never used for final synthesis.

5 · Quality gates

The pipeline runs two layers of gates. Pipeline-phase gates check each research stage as it produces JSON; content-and-SEO gates check the published HTML before deploy. Failure becomes a new rule — the gate file at scripts/check_quality.py is the live regression log.

Pipeline-phase gates (research output)

Research brief validity

Completeness ≥ 60%, content type set, target geographies non-empty. Fail action: surface clarification questions to the user.

Citation coverage

Every numeric claim has a _source URL; the URL was fetched and returned HTTP 200; the response body is archived. Fail action: re-invoke the research agent with the missing-field list.

Mathematical integrity

calculation_method == "python_subprocess_executed"; pessimistic < base < optimistic; revenue percentages sum to 1.0 ± 0.001. Fail action: halt the pipeline, require explicit confirmation.

Adversarial review minimum

At least three HIGH-severity issues identified. Below threshold means the critic is being too diplomatic; re-invoke with stricter framing.

Regulatory coverage

An entry in regulatory_compliance for every priority geography. Fail action: re-invoke the risk assessor with the missing-geographies list.

Section completeness

Every section heading non-null; no placeholder strings; financial figures reference canonical brief values verbatim.

Output file completeness

All output files present and non-empty; HTML files contain the expected number of sections; URLs resolve to HTTP 200.

Content + SEO gates (published HTML)

Fourteen gates run on every page in reports/final/ via scripts/check_quality.py. The full set runs locally with make audit and on every pull request via .github/workflows/quality-gates.yml.

Exactly one <h1> per page.

run_id

Block timestamped run identifiers in published prose.

Block job-application framing leaking into research voice.

jd_coverage

Block any reference to the deleted JD coverage page.

jargon

Block "stop test", "exit criteria", "adversarial review" jargon variants in user-facing prose.

marketing

Block specific unverified marketing badge phrases.

link_syntax

Block literal <https://...> markdown auto-link text.

internal_ids

Block internal codes like , in prose.

6 · The evidence chain

Every claim in the strategic brief traces back to an artefact in evidence/<run-id>/. The Evidence Map renders the trace human-readable. Three discipline rules govern the chain:

Citation requirement. Every numeric claim in any research output has a _source URL field. No URL → the claim is dropped.
Quote ≤ 15 words. Translations are stored separately; the original verbatim is preserved. Mistranslation is detectable.
Null beats fabrication. Any agent that cannot find real data writes null and adds an entry to data gaps. A null with documentation is valid; a fabricated number is a pipeline failure.

The whole pipeline state lives in flat JSON/YAML. diff works. git blame works. There is no opaque vector store or graph database to consult.

7 · Agent specialisation

Six specialist agents in the project's .claude/agents/ directory, each with a narrow contract. Routing decisions and effort budgets are documented in QUALITY_BAR.md.

deep-research-orchestrator

Runs parallel fan-out across Sonar Deep Research, Sonar Pro, Tongyi DeepResearch, and o4-mini. Cross-validates free arms with paid models.

peer-extractor

Extracts the peer set from research arms, deduplicates, tags discovery method and verification timestamp.

adversarial review-critic

Surfaces at least three HIGH-severity issues plus at least five MEDIUM-severity. Includes fabrication-risk detection and survivorship-bias check. Runs on Claude Opus.

canonical-synthesizer

Writes the immutable canonical brief; respects the cross-document consistency contract.

archetype-synthesizer

Constructs persona archetypes from peer cards and segment data.

report-renderer

Renders the final HTML with mobile-first design, JSON-LD, and Playwright-audited layout.

Five new skills (AI-search optimisation)

The session that produced this rewrite added five reusable skills (loaded by Claude Code from the user's ~/.claude/skills/ directory) and two universal rule files:

geo-optimization

Generative Engine Optimisation — make content cited by ChatGPT Search, Perplexity, Gemini.

aio-optimization

AI Overview Optimisation — land in Google AI Overviews. Direct-answer paragraphs, query fan-out, quarterly freshness.

aeo-optimization

Answer Engine Optimisation — voice search, featured snippets, "People also ask".

llmo-optimization

LLM Optimisation — crawler access, llms.txt spec, brand-entity reinforcement, the AI-crawler reference table.

seo-structured-data

Schema.org reference — the five JSON-LD types covering 80% of cases plus universal head-tag and JSON-LD templates.

+ rules

Universal rule files ai-search-optimization.md and content-quality-gates.md in ~/.claude/rules/common/.

The full research note backing those skills lives at evidence/research/ai-search-optimization/SUMMARY.md: 4,696 words, 67 unique sources, primary docs anchoring the load-bearing claims (Princeton GEO paper, llmstxt.org, OpenAI bot docs, Anthropic privacy page, Schema.org type pages).

8 · Frontend engineering

The output is the audit. A static HTML bundle a reviewer can open in a browser, in private mode, with DevTools open. No build step, no framework, no external resources.

Mobile-first design system

Fluid typography via clamp() — no breakpoint cliffs.
Mobile (< 720px): floating action button bottom-right opens a bottom-sheet drawer driven by <details> — works without JavaScript.
Tablet / desktop (≥ 720px): sticky always-visible side rail, never scrolls out of view.
WCAG 2.5.5 — 44 px tap targets on standalone controls; the audit script distinguishes inline-prose links (exempt) from standalone CTAs (required).
Dark mode, safe-area insets (env()), prefers-reduced-motion, print stylesheet.

Audit harness — Playwright

The harness in scripts/audit_mobile.mjs emulates 5 device profiles (iPhone 13, Pixel 7, iPhone SE, iPad Mini, desktop 1280) and captures: TTFB, FCP, LCP, CLS, transfer bytes, document height, horizontal overflow, console errors, network failures, viewport screenshots, full-page screenshots, and a tap-target audit that distinguishes inline-prose vs standalone controls.

FCP (mobile)

~ 30 ms

CLS

0.000

JS dependencies

9 · Multi-perspective audit

Before publication the bundle was audited by simulated specialists running in parallel — QA, ML, Data Science, SWE, DevOps. Each surfaces a different class of failure:

QA — reproducibility, gate coverage, regression-rule additions after each failure.
ML — hallucination prevention, confidence scoring, model-routing justification.
Data Science — number traceability, source tier separation, null discipline.
SWE — code idiom, semantic HTML, accessibility.
DevOps — deployment readiness, security headers, secrets handling, .gitignore hygiene, smoke-checks.

Audit reports live alongside the run artefacts in evidence/_audits/. The discipline is the same as the adversarial review: an audit that says "everything passes" is itself a failure.

10 · AI search optimisation

The site is engineered to be cited by AI search surfaces (ChatGPT Search, Perplexity, Google AI Overviews, Claude). The four 2026 paradigms — Generative Engine Optimisation (GEO), AI Overview Optimisation (AIO), Answer Engine Optimisation (AEO), and LLM Optimisation (LLMO) — are summarised in the cross-cutting reference at evidence/research/ai-search-optimization/SUMMARY.md.

Per-page

Forty-to-sixty word citable summary lead under H1.
JSON-LD @graph with Article + Organization + WebSite + BreadcrumbList + Person.
Open Graph + Twitter Card meta tags.
Canonical URL, language attribute, robots directive permitting full snippet reuse.
Mobile-first hamburger drawer that expands to a horizontal row at ≥ 880 px (WCAG 2.5.5 — 44 px tap targets).

Site-wide assets

Generated deterministically by scripts/build_seo_assets.py:

robots.txt

AI-crawler-aware allow / disallow per the 2026 reference. Allows GPTBot, OAI-SearchBot, ChatGPT-User, the three Anthropic bots, PerplexityBot, Google-Extended, Applebot-Extended; blocks Bytespider and the deprecated Anthropic agents.

llms.txt

Concise machine-readable index per the llmstxt.org spec — H1, blockquote summary, H2-grouped link list with one-line annotations.

llms-full.txt

Full plain-text concatenation of every page's body. AI agents visit it ~2× more than llms.txt (Semrush 2026).

sitemap.xml

Standard XML sitemap with lastmod from file mtime, priority, change frequency.

feed.xml

RSS 2.0 feed of every report page.

security.txt

RFC 9116 contact + policy file under /.well-known/.

manifest.webmanifest

Minimal PWA manifest. Sets the brand colour, icon, language, and category.

humans.txt

Human-readable team and credit file.

11 · Repository & engineering quality

The full source is at github.com/avaluev/padel-market-analysis — every script, every prompt, every test, every CI workflow. Apache 2.0.

Layout

scripts/ — pipeline builders, content sanitiser, SEO asset generator, quality gate, mobile audit.
tests/ — pytest suite for the new gates (51 tests).
reports/final/ — the published site (this directory is what GitHub Pages serves).
evidence/ — raw research output, audit reports, the AI-search research summary.
QUALITY_BAR.md — the rules every agent obeys.
.github/workflows/ — CI: deploy + quality-gates.

Engineering quality

Typed Python — mypy --strict on scripts/ and tests/; configured in pyproject.toml.
Linted + formatted — ruff for both lint and format; runs as a pre-commit hook.
Tested — 51 unit tests, ~72% coverage on the gate modules; tests required for every new gate.
CI gated — quality-gates.yml blocks merge to main on lint, typecheck, test, content-quality, link-verify, secret-scan failures.
Secret scanning — gitleaks as both a pre-commit hook and a CI job.
Reproducible builds — Makefile with make install, make test, make audit, make build; idempotency check in CI.
Dependency hygiene — Dependabot weekly runs on Python, npm, GitHub Actions.

Local audit

Reproduce the full pre-deploy gate set in one command:

git clone https://github.com/avaluev/padel-market-analysis.git
cd padel-market-analysis
make install # python dev deps + node deps
make audit # build + content + SEO gates

12 · What this scales to

The pipeline is domain-agnostic. The Padel AI Coach run is a worked example, not the product. Replace the input prompt and the same architecture produces:

An investor-grade brief on any market opportunity.
A competitive landscape with verified peer cards and moat analysis.
A capability map for any team — what ships solo, what needs a partner, what needs capital.
A risk register with regulatory coverage by geography and named pivot triggers.

Where to put it next: continuous research (run on a cron, diff the canonical brief, alert on material changes); enterprise customers (replace OpenRouter with self-hosted endpoints, swap the evidence store for Postgres+pgvector if the run cardinality demands it); decision support (treat the canonical brief as a structured input to product-roadmap and capital-allocation decisions).