wiki

the otis wiki

every word on this site, defined once, in one place.

agent · best-of-N · data parallelism · kv cache · otis · pass@k · ReAct · replica · rollout · runpod · sandbox · temperature · tensor parallelism · test-time compute · throughput · verifier · vllm

agent #

a language model wired into a loop where it can act, not just talk. it reads a task, decides on a step, runs a tool, sees the result, and repeats until it is done. in otis the agent is a single otter running a ReAct loop with two tools, a shell and a python runner. each rollout is one agent attempting the task once.

best-of-N #

make N independent attempts at a task, then keep the best one. it is the simplest form of test-time compute and the most parallel. it only works if you can actually tell which attempt is best, which is why otis pairs it with a real verifier. the result is summarised by pass@k.

data parallelism #

running a full copy of the model on each gpu, so several attempts run at once, one per card. the opposite of tensor parallelism. otis uses data parallelism: many small otters instead of one big one. the trade is that each replica is only as smart as a model that fits on a single card.

kv cache #

the memory a model keeps about the tokens it has already read, so it does not recompute them on every new token. it grows with the length of the conversation and eats gpu memory alongside the model weights. if otis runs out of memory, shrinking the context length shrinks the kv cache. served by vllm.

otis #

an otter, and the experiment named after him. otis measures how the success rate on a checked task rises as you spend more compute by taking more parallel rollouts across replicas and keeping the ones that pass. see about for the full picture.

pass@k #

the probability that at least one of k attempts succeeds. given N rollouts of which c passed, otis estimates it with the unbiased HumanEval formula 1 − C(N−c, k) / C(N, k). plotted against k it is the scaling curve on the method page. because otis verifies for real, pass@k doubles as an honest success rate, not just an oracle's best case.

ReAct #

a loop pattern for agents: reason, then act, then read the result, and repeat. the model writes a short thought, emits a tool call, otis runs it and returns the output, and the cycle continues until the model declares a final answer. it is the loop every otis agent runs.

replica #

one complete copy of the model running on one gpu, with its own server and port. four cards, four replicas, four otters fishing in parallel. replicas are how otis turns more gpus into more attempts; see data parallelism and the gpus page.

rollout #

a single end-to-end attempt at a task by one agent, from the first thought to the final check. one experiment is many rollouts run in parallel. each gets its own sandbox and is graded independently by the verifier.

runpod #

a cloud provider that rents gpu machines by the minute, including multi-card boxes. otis targets runpod pods because the experiment needs real nvidia hardware that an ordinary laptop or a cpu web host cannot provide. you start one replica per card and point the runner at them.

sandbox #

a private temporary directory given to each rollout, seeded with the task's starting files. the agent's shell and python tools can only touch this folder, so parallel otters never overwrite each other's work and the verifier grades a clean, isolated result.

temperature #

the dial on how random the model's choices are. low temperature makes it pick the safe, likely token every time; high temperature lets it wander. otis samples rollouts at 0.8 so the attempts actually diverge — best-of-N is pointless if every try is identical.

tensor parallelism #

splitting a single model's weights across several gpus so a model too big for one card can still run, the cards cooperating on each token. it buys one smarter attempt rather than many parallel ones, which is why otis prefers data parallelism instead. comparing the two at a fixed gpu budget is the experiment otis is built to grow into.

test-time compute #

spending more computation at the moment you ask a question, rather than training a bigger model ahead of time. it comes in two flavours: thinking longer (one long chain of reasoning) and thinking wider (many parallel attempts). otis studies the wide kind, because it scales cleanly across replicas. this is the trend the whole project pokes at.

throughput & gpu-seconds #

throughput is how many tokens the gpus produce per second; gpu-seconds is cards multiplied by wall-clock, the honest unit of "how much compute did this cost." otis logs tokens and time per rollout so success can be plotted against real gpu-seconds, not just against the count of tries.

verifier (check) #

a shell command attached to a task that runs against a finished sandbox and exits zero only on success: run the tests, compare the output, check the file. it is what makes best-of-N honest — selection by real pass/fail rather than the model's own opinion of itself.

vllm #

a fast inference server for language models. otis runs one vllm process per gpu (a replica), each exposing an openai-style api on its own port. it handles batching, the kv cache, and serving the weights so the agent code can stay simple.