a language model wired into a loop where it can act, not just talk. it
reads a task, decides on a step, runs a tool, sees the result, and repeats
until it is done. in otis the agent is a single otter running a
ReAct loop with two tools, a shell and a python
runner. each rollout is one agent attempting the
task once.
make N independent attempts at a task, then keep the best one. it is the
simplest form of test-time compute and the
most parallel. it only works if you can actually tell which attempt is
best, which is why otis pairs it with a real verifier.
the result is summarised by pass@k.
running a full copy of the model on each gpu, so several attempts run at
once, one per card. the opposite of tensor
parallelism. otis uses data parallelism: many small otters instead of
one big one. the trade is that each replica is only
as smart as a model that fits on a single card.
the memory a model keeps about the tokens it has already read, so it does
not recompute them on every new token. it grows with the length of the
conversation and eats gpu memory alongside the model weights. if otis runs
out of memory, shrinking the context length shrinks the kv cache. served
by vllm.
an otter, and the experiment named after him. otis measures how the success
rate on a checked task rises as you spend more compute by taking more
parallel rollouts across replicas
and keeping the ones that pass. see about for the
full picture.
the probability that at least one of k attempts succeeds. given N
rollouts of which c passed, otis estimates it with
the unbiased HumanEval formula
1 − C(N−c, k) / C(N, k). plotted against k it is
the scaling curve on the method page. because
otis verifies for real, pass@k doubles as an honest success rate, not just
an oracle's best case.
a loop pattern for agents: reason, then act, then read the
result, and repeat. the model writes a short thought, emits a tool call,
otis runs it and returns the output, and the cycle continues until the
model declares a final answer. it is the loop every otis
agent runs.
one complete copy of the model running on one gpu, with its own server and
port. four cards, four replicas, four otters fishing in parallel. replicas
are how otis turns more gpus into more attempts; see
data parallelism and the
gpus page.
a single end-to-end attempt at a task by one agent,
from the first thought to the final check. one experiment is many rollouts
run in parallel. each gets its own sandbox and is
graded independently by the verifier.
a cloud provider that rents gpu machines by the minute, including
multi-card boxes. otis targets runpod pods because the experiment needs
real nvidia hardware that an ordinary laptop or a cpu web host cannot
provide. you start one replica per card and point
the runner at them.
a private temporary directory given to each rollout,
seeded with the task's starting files. the agent's shell and python tools
can only touch this folder, so parallel otters never overwrite each other's
work and the verifier grades a clean, isolated
result.
the dial on how random the model's choices are. low temperature makes it
pick the safe, likely token every time; high temperature lets it wander.
otis samples rollouts at 0.8 so the attempts actually diverge —
best-of-N is pointless if every try is identical.
splitting a single model's weights across several gpus so a model too big
for one card can still run, the cards cooperating on each token. it buys
one smarter attempt rather than many parallel ones, which is why otis
prefers data parallelism instead. comparing
the two at a fixed gpu budget is the experiment otis is built to grow into.
spending more computation at the moment you ask a question, rather than
training a bigger model ahead of time. it comes in two flavours: thinking
longer (one long chain of reasoning) and thinking wider (many parallel
attempts). otis studies the wide kind, because it scales cleanly across
replicas. this is the trend the whole project pokes
at.
throughput is how many tokens the gpus produce per second; gpu-seconds is
cards multiplied by wall-clock, the honest unit of "how much compute did
this cost." otis logs tokens and time per rollout so
success can be plotted against real gpu-seconds, not just against the count
of tries.
a shell command attached to a task that runs against a finished
sandbox and exits zero only on success: run the
tests, compare the output, check the file. it is what makes
best-of-N honest — selection by real
pass/fail rather than the model's own opinion of itself.
a fast inference server for language models. otis runs one vllm process per
gpu (a replica), each exposing an openai-style api on
its own port. it handles batching, the kv cache,
and serving the weights so the agent code can stay simple.