Autoresearch implementation for testing

2026-04-05 10:16:25 +00:00 · 2026-03-24 10:37:51 +01:00
parent 4a435bf13d
commit a0086da16b
12 changed files with 1874 additions and 330 deletions
--- a/program.md
+++ b/program.md
@@ -0,0 +1,115 @@
+# autoresearch
+
+This is an experiment to have the LLM do its own research.
+
+## Setup
+
+To set up a new experiment, work with the user to:
+
+1. **Agree on a run tag**: propose a tag based on today's date (e.g. `mar24`). The branch `autoresearch/<tag>` must not already exist — this is a fresh run.
+2. **Create the branch**: `git checkout -b autoresearch/<tag>` from current master.
+3. **Read the in-scope files**: The repo is small. Read these files for full context:
+   - `README.md` — repository context.
+   - `prepare.py` — fixed runtime utilities, summary extraction, and dataset checks. Do not modify.
+   - `train.py` — the file you modify. Model choice, optimizer, hyperparameters, image size, and training loop entrypoint all live here.
+4. **Verify data exists**: Check that `ships-aerial-images/data.yaml` exists, or that `YOLO_DATA` points to a valid dataset YAML. If not, tell the human to add the dataset first.
+5. **Initialize results.tsv**: Create `results.tsv` with just the header row. The baseline will be recorded after the first run.
+6. **Confirm and go**: Confirm setup looks good.
+
+Once you get confirmation, kick off the experimentation.
+
+## Experimentation
+
+Each experiment runs through `uv run train.py`.
+
+The training script uses a **fixed 5-minute time budget** through Ultralytics' `time` argument, so experiments are approximately comparable and always short enough to iterate quickly.
+
+**What you CAN do:**
+- Modify `train.py` — this is the only file you edit. Everything there is fair game: model size, model weights, image size, batch size, optimizer, learning rate schedule, augmentation knobs, worker count, freeze settings, and similar training parameters.
+
+**What you CANNOT do:**
+- Modify `prepare.py`. It is read-only.
+- Install new packages or add dependencies. You can only use what's already in `pyproject.toml`.
+- Modify the evaluation harness outside the normal Ultralytics validation outputs produced by the training run.
+
+**The goal is simple: get the highest `metrics/mAP50-95(B)`.** Higher is better. Since the time budget is fixed, the core job is to find the best-performing experiment under that fixed budget.
+
+**VRAM** is a soft constraint. Some increase is acceptable for meaningful gains, but avoid ideas that blow up memory or make experiments fragile.
+
+**Simplicity criterion**: All else being equal, simpler is better. A tiny gain that adds ugly complexity is usually not worth it. Removing complexity while keeping equal or better quality is a win.
+
+**The first run**: Your very first run should always be the baseline, so run the training script as is before changing anything.
+
+## Output format
+
+Once the script finishes it prints a summary like this:
+
+```
+---
+fitness_key:       metrics/mAP50-95(B)
+fitness:           0.612345
+training_seconds:  300.1
+total_seconds:     300.1
+peak_vram_mb:      8240.5
+precision:         0.801234
+recall:            0.745678
+map50:             0.822222
+map50_95:          0.612345
+epoch:             18
+```
+
+You can extract the key metric from the log file with:
+
+```
+grep "^fitness:\|^peak_vram_mb:" run.log
+```
+
+## Logging results
+
+When an experiment is done, log it to `results.tsv` (tab-separated, NOT comma-separated — commas break descriptions).
+
+The TSV has a header row and 5 columns:
+
+```
+commit	metric	memory_gb	status	description
+```
+
+1. git commit hash (short, 7 chars)
+2. `metrics/mAP50-95(B)` achieved (e.g. 0.612345) — use `0.000000` for crashes
+3. peak memory in GB, round to `.1f` (divide `peak_vram_mb` by 1024) — use `0.0` for crashes
+4. status: `keep`, `discard`, or `crash`
+5. short text description of what the experiment tried
+
+Example:
+
+```
+commit	metric	memory_gb	status	description
+a1b2c3d	0.612345	8.1	keep	baseline yolo11l 640 adamw
+b2c3d4e	0.618901	9.4	keep	increase image size to 768
+c3d4e5f	0.605100	7.9	discard	reduce batch and switch optimizer
+d4e5f6g	0.000000	0.0	crash	batch too large caused OOM
+```
+
+## The experiment loop
+
+The experiment runs on a dedicated branch (e.g. `autoresearch/mar24`).
+
+LOOP FOREVER:
+
+1. Look at the git state: the current branch and commit.
+2. Tune `train.py` with one experimental idea.
+3. git commit
+4. Run the experiment: `uv run train.py > run.log 2>&1`
+5. Read out the results: `grep "^fitness:\|^peak_vram_mb:" run.log`
+6. If the grep output is empty, the run crashed. Read the traceback from `run.log`, attempt a fix if it is easy, otherwise mark it as a crash and move on.
+7. Record the result in `results.tsv` (do not commit `results.tsv`; leave it untracked).
+8. If the metric improved, keep the commit.
+9. If the metric is equal or worse, reset back to where you started.
+
+The idea is that you are a completely autonomous researcher trying things out. If they work, keep. If they don't, discard. Advance the branch only with improvements.
+
+**Timeout**: Each experiment should take about 5 minutes total, plus a small amount of overhead. If a run exceeds 10 minutes, kill it and treat it as a failure.
+
+**Crashes**: If a run crashes (OOM, bad hyperparameters, a typo, etc.), use judgment. If it is something dumb and easy to fix, fix it and re-run. If the idea is fundamentally broken, log it as `crash` and move on.
+
+**NEVER STOP**: Once the experiment loop has begun, do not pause to ask whether you should continue. Keep going until the human interrupts you.