Proof Foundry / Autonomous ML Lab

The human-governed autonomous ML lab

Proof Foundry helps teams run long-running machine learning research through autonomous agents, experiment ledgers, verifier gates, budgets, approvals, and durable evidence for every model, metric, and claim.

See the lab in action

Built on ML Junior. Designed for researchers, model builders, frontier startups, and serious open-source labs.

“Every claim must survive contact with evidence.”

Reproduce Paper

Literature extraction

Dataset audit

Baseline run

Verifier gate

Experiment

0.768

validation F1 · exp_014

verifier: passed

Evidence Ledger

claim+3.7 F1 improvement

baselineexp_001

datasetds_003

commitabc123

verifierpassed

Approval required

Launch 4h GPU job

Risk: compute spend

Built on ML Junior. Orchestrated for long-running research. Designed around evidence, budgets, approvals, and verifier gates.

ML Junior runtimeHugging Face-native workflowsPaperclip-style governanceExperiment ledgerHuman approval gates

Most AI agents generate output. Proof Foundry generates research you can trust.

Long-running ML work collapses when the only source of truth is a chat transcript. Experiments disappear into logs. Metrics lose provenance. Agents forget what they tried. Humans return hours later and cannot tell what is real.

Proof Foundry is built for the actual ML research loop: hypotheses, data, baselines, experiments, failures, artifacts, and verification.

The problem

Chat history becomes state
Metrics are detached from runs
Failed experiments are forgotten
Long projects become illegible

What research needs

Durable workflow state
Experiment ledgers
Dataset/code snapshots
Evidence-backed claims
Human approval and budget controls

What Proof Foundry adds

Flow templates
Research agents
Verifier gates
Evidence ledger
Lab dashboard

System Architecture

A new operating system for ML research

Proof Foundry combines autonomous agents, reproducible ML workflows, experiment ledgers, and human governance into one research control plane.

Research Programs

Organize long-running directions into missions, projects, and flow templates.

Autonomous ML Workers

Specialized agents read papers, inspect datasets, implement code, launch experiments, and prepare reports.

Experiment Ledger

Every run records code, data, config, metrics, logs, artifacts, cost, and verifier verdict.

Proof Gate

Final claims must link to evidence before they are accepted, published, or reported.

Research Flow

Mission

Research Program

Project

Flow Template

Experiment Run

Evidence Ledger

Verifier Gate

Publish

Our Principles

Manifesto

Research is not chat.

It is a sequence of hypotheses, experiments, failures, and evidence.

Autonomy requires governance.

Powerful agents need budgets, policies, approvals, and oversight.

Metrics without provenance are noise.

Every result must link back to code, data, config, and logs.

Long-running work must remain legible.

A researcher should return after hours or days and instantly understand the state of a project.

Publishing is a gate, not a side effect.

Nothing important gets shipped without verification.

The lab is the product.

Not just the agent. The system around the agent.

Process

From idea to verified result

Define the mission

Set objective, constraints, budget, and approval rules.

Mission

Reproduce BERT results

Budget: $500

Launch a research program

Turn broad research goals into projects and workstreams.

Program

Select a flow template

Use structured protocols for paper reproduction, fine-tuning, ablations, dataset audits, or benchmark creation.

Template

Paper Reproduction

Run experiments

Agents inspect data, write code, launch jobs, parse logs, and track metrics.

Running

Track evidence

Every run links to dataset snapshots, code snapshots, configs, logs, artifacts, and costs.

Evidence

Verify claims

The verifier checks metric provenance, baseline fairness, split integrity, and artifact availability.

Verify

passed

Publish with confidence

Reports, model cards, and artifacts are released only after human approval.

Publish

Model Card Ready

Demo

Watch a lab come alive

A single request becomes a governed research workflow: agents are assigned, phases appear, experiments run, metrics are verified, and final claims are backed by proof.

User prompt

"Reproduce this paper and see if we can improve the baseline by 2%."

Lab Dashboard

Paper Reproduction

Research Program · Active

Running

idle

Literature Scientist

active

ML Engineer

waiting

Eval Auditor

Flow Progress

Literature extraction

Dataset audit

Baseline run

Experiment iterations

Verifier gate

Publish artifacts

exp_014running

val_f1

0.768

baseline

0.731

01/07

Lab created

Proof Foundry creates a research project, selects the Reproduce Paper flow, and assigns specialized roles.

Research Program

02/07

The paper becomes a protocol

The Literature Scientist extracts the method, target metrics, datasets, and reproducibility risks.

Method Extraction

Target: F1 score

Dataset: GLUE

03/07

The flow becomes visible

Long-running work is no longer hidden in chat. Every phase has status, artifacts, and blockers.

Extraction

Baseline

Verify

04/07

Experiments move through the board

ML Junior writes code, launches runs, records logs, parses metrics, and tracks costs.

exp_001

exp_008

exp_014

exp_015

05/07

Weak results are rejected

The Eval Auditor blocks invalid results instead of letting them become impressive-looking claims.

Rejected

Split integrity failed

06/07

Claims receive proof

Every final claim links to the exact run, dataset, code, config, logs, and verifier verdict.

Evidence Ledger

claim+3.7 F1

runexp_014

verifierpassed

07/07

Publish requires approval

The lab can prepare artifacts autonomously, but publishing remains a human-governed decision.

Approval Required

Publish model card?

Not a chatbot. A governed research system.

Comparison

Built for machine learning, not generic automation

Generic AI Agents

Proof Foundry

Chat transcript as state

Experiment ledger as truth

Vague progress updates

Flow phases and visible state

Unsupported metrics

Evidence-backed results

One-off task execution

Research programs and projects

No verifier

Proof Gate before claims

No compute governance

Budgets and approvals

Code-first

Hypothesis → experiment → artifact

Generic agents are optimized for coding tasks. Proof Foundry is optimized for ML research workflows where the hard part is not just generating code, but proving what happened.

Architecture

The lab stack

Proof Foundry is built as a research control plane around ML Junior. The system separates orchestration, execution, evidence, and verification.

Human Board

Approves budgets, compute, publishing, metrics, and final verified claims.

Proof Foundry Control Plane

Orchestrates research programs, flow templates, and agent coordination.

Research Programs + Flow Templates

Structured protocols for reproducible ML research workflows.

ML Junior Runtime

Executes Hugging Face-native ML work: papers, datasets, code, jobs, metrics, model cards.

Evidence Ledger

Tracks hypotheses, snapshots, runs, logs, metrics, artifacts, and costs.

Verifier Gate

Blocks unsupported claims and prevents accidental publishing.

Audience

For the people building serious ML systems

Indie researchers

Run ambitious research programs without losing context or reproducibility.

Frontier AI startups

Coordinate model experiments, benchmarks, and artifacts with governance from day one.

Applied ML teams

Turn repeated workflows into templates and keep experiments auditable.

Open-source model builders

Publish better model cards, reproducibility reports, and evidence-backed releases.

Research labs

Use autonomous agents while preserving human oversight, budgets, and verification.

Early Access

Join the founding waitlist

Proof Foundry is in development. We are inviting serious researchers, ML engineers, model builders, and frontier teams to shape the first human-governed autonomous ML lab.

Founding Cohort

Limited early access

Questions

Frequently asked questions

The future of ML research is not more chaos. It is governed autonomy with proof.

Proof Foundry is building the lab we wanted to use ourselves: autonomous where useful, governed where necessary, and evidence-backed everywhere.

MISSION→EXPERIMENT→EVIDENCE→VERDICT

The human-governed autonomous ML lab

Most AI agents generate output. Proof Foundry generates research you can trust.

The problem

What research needs

What Proof Foundry adds

A new operating system for ML research

Research Programs

Autonomous ML Workers

Experiment Ledger

Proof Gate

Research Flow

Manifesto

Research is not chat.

Autonomy requires governance.

Metrics without provenance are noise.

Long-running work must remain legible.

Publishing is a gate, not a side effect.

The lab is the product.

From idea to verified result

Define the mission

Launch a research program

Select a flow template

Run experiments

Track evidence

Verify claims

Publish with confidence

Watch a lab come alive

Lab created

The paper becomes a protocol

The flow becomes visible

Experiments move through the board

Weak results are rejected

Claims receive proof

Publish requires approval

Built for machine learning, not generic automation

The lab stack

Human Board

Proof Foundry Control Plane

Research Programs + Flow Templates

ML Junior Runtime

Evidence Ledger

Verifier Gate

For the people building serious ML systems

Indie researchers

Frontier AI startups

Applied ML teams

Open-source model builders

Research labs

Join the founding waitlist

Frequently asked questions

Is Proof Foundry fully autonomous?

How is this different from a coding agent?

What is ML Junior?

Does Proof Foundry replace researchers?

Is it open source?

When will it launch?

The future of ML research is not more chaos. It is governed autonomy with proof.