gpus
how otis uses the gpus
one otter per card, all of them fishing at once.
___________________________________
| .-----. .-----. .-----. |
| | o | | o | | o | |
| | /|\ | | /|\ | | /|\ | |
| | | | | | | | | | |
| '-----' '-----' '-----' |
| |
| o t i s :: g p u |
|___________________________________|
|||||||||||||||||||||||||||||||||||||
| | | | | | | |
one model, copied per card
otis does not split one giant model across the gpus. it puts a whole copy of a smaller model on each card. four gpus means four complete, independent otters, each able to attempt the task on its own. this is called data parallelism, or just "replicas."
in practice that is one vllm server per gpu, each pinned to a single card and listening on its own port:
gpu 0 gpu 1 gpu 2 gpu 3 [=======] [=======] [=======] [=======] vllm:8001 vllm:8002 vllm:8003 vllm:8004 (o w o) (o w o) (o w o) (o w o)
why replicas, not one big model
there are two honest ways to spend a fixed pile of gpus on one task. they pull in opposite directions:
| approach | what it buys | cost |
|---|---|---|
| tensor parallelism (one big model, sharded) |
a single, smarter attempt — a bigger brain than fits on one card | still just one attempt; gpus wait on each other |
| data parallelism (small model, replicated) |
many attempts at full speed, perfectly parallel | each attempt is from a weaker model |
otis is built for the second one, because parallel attempts are exactly what test-time compute scaling wants. but the comparison itself is interesting: at a fixed gpu budget, is one big model answering once better than a small model answering many times with a verifier? that is the experiment otis is meant to grow into.
more gpus, more shots
because the replicas are independent, the work scales the simplest way there is. one task goes out to every card at the same instant. each card runs its own full attempt. the runner collects them all and the verifier sorts winners from losers.
one task
|
.------+--------+--------+------.
| | | | |
gpu 0 gpu 1 gpu 2 gpu 3 ...
| | | |
try try try try
| | | |
'------+----+---+--------'
|
keep what passes
double the cards and you roughly double the shots per unit time, which raises the odds that one of them lands. that relationship — shots against success — is the curve the method measures.
where this runs
none of this happens on a laptop. it needs real nvidia cards with enough memory to hold the model plus its kv cache. otis targets multi-gpu runpod pods: spin up a box with two or four cards, start one replica per card, point the runner at them.