The human-governed autonomous ML lab
Proof Foundry helps teams run long-running machine learning research through autonomous agents, experiment ledgers, verifier gates, budgets, approvals, and durable evidence for every model, metric, and claim.
Built on ML Junior. Designed for researchers, model builders, frontier startups, and serious open-source labs.
“Every claim must survive contact with evidence.”
Built on ML Junior. Orchestrated for long-running research. Designed around evidence, budgets, approvals, and verifier gates.
Most AI agents generate output. Proof Foundry generates research you can trust.
Long-running ML work collapses when the only source of truth is a chat transcript. Experiments disappear into logs. Metrics lose provenance. Agents forget what they tried. Humans return hours later and cannot tell what is real.
Proof Foundry is built for the actual ML research loop: hypotheses, data, baselines, experiments, failures, artifacts, and verification.
The problem
- Chat history becomes state
- Metrics are detached from runs
- Failed experiments are forgotten
- Long projects become illegible
What research needs
- Durable workflow state
- Experiment ledgers
- Dataset/code snapshots
- Evidence-backed claims
- Human approval and budget controls
What Proof Foundry adds
- Flow templates
- Research agents
- Verifier gates
- Evidence ledger
- Lab dashboard
A new operating system for ML research
Proof Foundry combines autonomous agents, reproducible ML workflows, experiment ledgers, and human governance into one research control plane.
Research Programs
Organize long-running directions into missions, projects, and flow templates.
Autonomous ML Workers
Specialized agents read papers, inspect datasets, implement code, launch experiments, and prepare reports.
Experiment Ledger
Every run records code, data, config, metrics, logs, artifacts, cost, and verifier verdict.
Proof Gate
Final claims must link to evidence before they are accepted, published, or reported.
Research Flow
Manifesto
Research is not chat.
It is a sequence of hypotheses, experiments, failures, and evidence.
Autonomy requires governance.
Powerful agents need budgets, policies, approvals, and oversight.
Metrics without provenance are noise.
Every result must link back to code, data, config, and logs.
Long-running work must remain legible.
A researcher should return after hours or days and instantly understand the state of a project.
Publishing is a gate, not a side effect.
Nothing important gets shipped without verification.
The lab is the product.
Not just the agent. The system around the agent.
From idea to verified result
Define the mission
Set objective, constraints, budget, and approval rules.
Launch a research program
Turn broad research goals into projects and workstreams.
Select a flow template
Use structured protocols for paper reproduction, fine-tuning, ablations, dataset audits, or benchmark creation.
Run experiments
Agents inspect data, write code, launch jobs, parse logs, and track metrics.
Track evidence
Every run links to dataset snapshots, code snapshots, configs, logs, artifacts, and costs.
Verify claims
The verifier checks metric provenance, baseline fairness, split integrity, and artifact availability.
Publish with confidence
Reports, model cards, and artifacts are released only after human approval.
Watch a lab come alive
A single request becomes a governed research workflow: agents are assigned, phases appear, experiments run, metrics are verified, and final claims are backed by proof.
"Reproduce this paper and see if we can improve the baseline by 2%."
Lab created
Proof Foundry creates a research project, selects the Reproduce Paper flow, and assigns specialized roles.
The paper becomes a protocol
The Literature Scientist extracts the method, target metrics, datasets, and reproducibility risks.
The flow becomes visible
Long-running work is no longer hidden in chat. Every phase has status, artifacts, and blockers.
Experiments move through the board
ML Junior writes code, launches runs, records logs, parses metrics, and tracks costs.
Weak results are rejected
The Eval Auditor blocks invalid results instead of letting them become impressive-looking claims.
Claims receive proof
Every final claim links to the exact run, dataset, code, config, logs, and verifier verdict.
Publish requires approval
The lab can prepare artifacts autonomously, but publishing remains a human-governed decision.
Not a chatbot. A governed research system.
Built for machine learning, not generic automation
Generic agents are optimized for coding tasks. Proof Foundry is optimized for ML research workflows where the hard part is not just generating code, but proving what happened.
The lab stack
Proof Foundry is built as a research control plane around ML Junior. The system separates orchestration, execution, evidence, and verification.
Human Board
Approves budgets, compute, publishing, metrics, and final verified claims.
Proof Foundry Control Plane
Orchestrates research programs, flow templates, and agent coordination.
Research Programs + Flow Templates
Structured protocols for reproducible ML research workflows.
ML Junior Runtime
Executes Hugging Face-native ML work: papers, datasets, code, jobs, metrics, model cards.
Evidence Ledger
Tracks hypotheses, snapshots, runs, logs, metrics, artifacts, and costs.
Verifier Gate
Blocks unsupported claims and prevents accidental publishing.
For the people building serious ML systems
Indie researchers
Run ambitious research programs without losing context or reproducibility.
Frontier AI startups
Coordinate model experiments, benchmarks, and artifacts with governance from day one.
Applied ML teams
Turn repeated workflows into templates and keep experiments auditable.
Open-source model builders
Publish better model cards, reproducibility reports, and evidence-backed releases.
Research labs
Use autonomous agents while preserving human oversight, budgets, and verification.
Join the founding waitlist
Proof Foundry is in development. We are inviting serious researchers, ML engineers, model builders, and frontier teams to shape the first human-governed autonomous ML lab.