Proof Foundry / Autonomous ML Lab

The human-governed autonomous ML lab

Proof Foundry helps teams run long-running machine learning research through autonomous agents, experiment ledgers, verifier gates, budgets, approvals, and durable evidence for every model, metric, and claim.

Built on ML Junior. Designed for researchers, model builders, frontier startups, and serious open-source labs.

“Every claim must survive contact with evidence.”

Built on ML Junior. Orchestrated for long-running research. Designed around evidence, budgets, approvals, and verifier gates.

ML Junior runtimeHugging Face-native workflowsPaperclip-style governanceExperiment ledgerHuman approval gates

Most AI agents generate output. Proof Foundry generates research you can trust.

Long-running ML work collapses when the only source of truth is a chat transcript. Experiments disappear into logs. Metrics lose provenance. Agents forget what they tried. Humans return hours later and cannot tell what is real.

Proof Foundry is built for the actual ML research loop: hypotheses, data, baselines, experiments, failures, artifacts, and verification.

The problem

  • Chat history becomes state
  • Metrics are detached from runs
  • Failed experiments are forgotten
  • Long projects become illegible

What research needs

  • Durable workflow state
  • Experiment ledgers
  • Dataset/code snapshots
  • Evidence-backed claims
  • Human approval and budget controls

What Proof Foundry adds

  • Flow templates
  • Research agents
  • Verifier gates
  • Evidence ledger
  • Lab dashboard
System Architecture

A new operating system for ML research

Proof Foundry combines autonomous agents, reproducible ML workflows, experiment ledgers, and human governance into one research control plane.

Research Programs

Organize long-running directions into missions, projects, and flow templates.

Autonomous ML Workers

Specialized agents read papers, inspect datasets, implement code, launch experiments, and prepare reports.

Experiment Ledger

Every run records code, data, config, metrics, logs, artifacts, cost, and verifier verdict.

Proof Gate

Final claims must link to evidence before they are accepted, published, or reported.

Research Flow

Mission
Research Program
Project
Flow Template
Experiment Run
Evidence Ledger
Verifier Gate
Publish
Our Principles

Manifesto

1

Research is not chat.

It is a sequence of hypotheses, experiments, failures, and evidence.

2

Autonomy requires governance.

Powerful agents need budgets, policies, approvals, and oversight.

3

Metrics without provenance are noise.

Every result must link back to code, data, config, and logs.

4

Long-running work must remain legible.

A researcher should return after hours or days and instantly understand the state of a project.

5

Publishing is a gate, not a side effect.

Nothing important gets shipped without verification.

6

The lab is the product.

Not just the agent. The system around the agent.

Process

From idea to verified result

1

Define the mission

Set objective, constraints, budget, and approval rules.

2

Launch a research program

Turn broad research goals into projects and workstreams.

3

Select a flow template

Use structured protocols for paper reproduction, fine-tuning, ablations, dataset audits, or benchmark creation.

4

Run experiments

Agents inspect data, write code, launch jobs, parse logs, and track metrics.

5

Track evidence

Every run links to dataset snapshots, code snapshots, configs, logs, artifacts, and costs.

6

Verify claims

The verifier checks metric provenance, baseline fairness, split integrity, and artifact availability.

7

Publish with confidence

Reports, model cards, and artifacts are released only after human approval.

Demo

Watch a lab come alive

A single request becomes a governed research workflow: agents are assigned, phases appear, experiments run, metrics are verified, and final claims are backed by proof.

User prompt

"Reproduce this paper and see if we can improve the baseline by 2%."

01/07

Lab created

Proof Foundry creates a research project, selects the Reproduce Paper flow, and assigns specialized roles.

Research Program
02/07

The paper becomes a protocol

The Literature Scientist extracts the method, target metrics, datasets, and reproducibility risks.

Method Extraction
Target: F1 score
Dataset: GLUE
03/07

The flow becomes visible

Long-running work is no longer hidden in chat. Every phase has status, artifacts, and blockers.

Extraction
Baseline
Verify
04/07

Experiments move through the board

ML Junior writes code, launches runs, records logs, parses metrics, and tracks costs.

exp_001
exp_008
exp_014
exp_015
05/07

Weak results are rejected

The Eval Auditor blocks invalid results instead of letting them become impressive-looking claims.

Rejected
Split integrity failed
06/07

Claims receive proof

Every final claim links to the exact run, dataset, code, config, logs, and verifier verdict.

Evidence Ledger
claim+3.7 F1
runexp_014
verifierpassed
07/07

Publish requires approval

The lab can prepare artifacts autonomously, but publishing remains a human-governed decision.

Approval Required
Publish model card?

Not a chatbot. A governed research system.

Comparison

Built for machine learning, not generic automation

Generic AI Agents
Proof Foundry
Chat transcript as state
Experiment ledger as truth
Vague progress updates
Flow phases and visible state
Unsupported metrics
Evidence-backed results
One-off task execution
Research programs and projects
No verifier
Proof Gate before claims
No compute governance
Budgets and approvals
Code-first
Hypothesis → experiment → artifact

Generic agents are optimized for coding tasks. Proof Foundry is optimized for ML research workflows where the hard part is not just generating code, but proving what happened.

Architecture

The lab stack

Proof Foundry is built as a research control plane around ML Junior. The system separates orchestration, execution, evidence, and verification.

Human Board

Approves budgets, compute, publishing, metrics, and final verified claims.

Proof Foundry Control Plane

Orchestrates research programs, flow templates, and agent coordination.

Research Programs + Flow Templates

Structured protocols for reproducible ML research workflows.

ML Junior Runtime

Executes Hugging Face-native ML work: papers, datasets, code, jobs, metrics, model cards.

Evidence Ledger

Tracks hypotheses, snapshots, runs, logs, metrics, artifacts, and costs.

Verifier Gate

Blocks unsupported claims and prevents accidental publishing.

Audience

For the people building serious ML systems

Indie researchers

Run ambitious research programs without losing context or reproducibility.

Frontier AI startups

Coordinate model experiments, benchmarks, and artifacts with governance from day one.

Applied ML teams

Turn repeated workflows into templates and keep experiments auditable.

Open-source model builders

Publish better model cards, reproducibility reports, and evidence-backed releases.

Research labs

Use autonomous agents while preserving human oversight, budgets, and verification.

Early Access

Join the founding waitlist

Proof Foundry is in development. We are inviting serious researchers, ML engineers, model builders, and frontier teams to shape the first human-governed autonomous ML lab.

EST. 2024
Founding Cohort
Limited early access

Early access is limited to builders working on serious ML research workflows.

Questions

Frequently asked questions

The future of ML research is not more chaos. It is governed autonomy with proof.

Proof Foundry is building the lab we wanted to use ourselves: autonomous where useful, governed where necessary, and evidence-backed everywhere.

MISSIONEXPERIMENTEVIDENCEVERDICT