method

how otis runs an experiment

task in, curve out.

the shape of a run

one run takes a single task and a number of tries. otis copies the task into that many private sandboxes, sends a fresh otter at each one across the gpu replicas, checks every result, and reports how many passed.

   task ──► N sandboxes ──► N otters (parallel) ──► verify each ──► curve

step 1 — the task

a task is a folder. it holds a prompt, an optional starting workspace (files the agent begins with), and a check: a shell command that returns zero only on success.

example — "fix the bug in mathutils.py so the tests pass." check: python3 test_mathutils.py. exit zero means fixed.

step 2 — the agent loop

each try is one agent running a ReAct loop: the model thinks, then either calls a tool or declares it is finished. otis runs the tool, feeds the output back, and the loop continues until the agent stops or hits a step limit.

   ┌─────────────────────────────────────┐
   │  think  →  call a tool  →  see result │
   └──────────────▲─────────────┬─────────┘
                  └─────────────┘   until done

the tools are a shell and a python runner, both confined to that try's sandbox so the parallel otters never touch each other's files.

step 3 — temperature matters

the tries are sampled at temperature 0.8, on purpose. taking many shots only helps if the shots differ. at low temperature every otter makes nearly the same moves and the whole experiment collapses to one attempt repeated. divergence is the point.

step 4 — verify, then count

when an otter finishes, otis runs the check against its sandbox. pass or fail, recorded honestly. with N tries and c of them passing, the headline metric is pass@k: the chance that a random handful of k tries contains at least one winner.

                      C(N - c, k)
   pass@k  =  1  -  ───────────────
                       C(N, k)

this is the unbiased estimator from the HumanEval paper. it lets a single batch of tries describe the whole curve from k=1 upward.

the curve

plot pass@k against k and you see the experiment's whole point: each extra try is extra compute, and the line shows what that compute buys.

   pass@k
    1.0 │                  ● ─── ● ─── ●
        │            ●
    0.8 │
        │       ●
    0.6 │
        │   ●
    0.4 │
        │ ●
    0.2 │
        └────┬────┬────┬────┬────┬────┬──► k
             1    2    4    8   16   32

one try lands ~40% of the time here; eight tries land almost always. the gap is the whole thesis: cheap reliability bought with parallel compute, on a model that was never that reliable alone.

reading compute honestly

k is a convenient x-axis, but tries are not all the same size. some otters flail for many steps; some solve it in two. so otis also logs the completion tokens and wall-clock of every try. that lets you replot success against real gpu-seconds, which is the honest "what did the compute cost" chart.

the catch: calibration

the curve only says anything when one try succeeds sometimes — roughly 10 to 40 percent of the time. too easy and every k passes, a flat line at the top. too hard and none do, a flat line at the bottom. picking tasks that land in that band, for a given model, is the real work of the experiment. the plumbing is easy; the calibration is the science.

the starter tasks

task	what it asks	difficulty
fizzbuzz	write a script with the right output	easy floor
csv-sum	sum a column under a condition	medium
fix-bug	repair code until tests pass	medium
word-count	parse text, find the top word	medium-hard

they span the range on purpose. run each, keep the ones that land in the band for your model, treat the rest as calibration data.