gpus

how otis uses the gpus

one otter per card, all of them fishing at once.


    ___________________________________
   |  .-----.    .-----.    .-----.    |
   | |   o   |  |   o   |  |   o   |   |
   | |  /|\  |  |  /|\  |  |  /|\  |   |
   | |   |   |  |   |   |  |   |   |   |
   |  '-----'    '-----'    '-----'    |
   |                                   |
   |   o t i s   ::   g p u            |
   |___________________________________|
   |||||||||||||||||||||||||||||||||||||
        |   |   |   |   |   |   |   |
  

one model, copied per card

otis does not split one giant model across the gpus. it puts a whole copy of a smaller model on each card. four gpus means four complete, independent otters, each able to attempt the task on its own. this is called data parallelism, or just "replicas."

in practice that is one vllm server per gpu, each pinned to a single card and listening on its own port:

   gpu 0          gpu 1          gpu 2          gpu 3
  [=======]      [=======]      [=======]      [=======]
   vllm:8001      vllm:8002      vllm:8003      vllm:8004
   (o w o)        (o w o)        (o w o)        (o w o)
  

why replicas, not one big model

there are two honest ways to spend a fixed pile of gpus on one task. they pull in opposite directions:

approachwhat it buyscost
tensor parallelism
(one big model, sharded)
a single, smarter attempt — a bigger brain than fits on one card still just one attempt; gpus wait on each other
data parallelism
(small model, replicated)
many attempts at full speed, perfectly parallel each attempt is from a weaker model

otis is built for the second one, because parallel attempts are exactly what test-time compute scaling wants. but the comparison itself is interesting: at a fixed gpu budget, is one big model answering once better than a small model answering many times with a verifier? that is the experiment otis is meant to grow into.

more gpus, more shots

because the replicas are independent, the work scales the simplest way there is. one task goes out to every card at the same instant. each card runs its own full attempt. the runner collects them all and the verifier sorts winners from losers.

                     one task
                        |
        .------+--------+--------+------.
        |      |        |        |      |
      gpu 0  gpu 1    gpu 2    gpu 3   ...
        |      |        |        |
       try    try      try      try
        |      |        |        |
        '------+----+---+--------'
                    |
             keep what passes
  

double the cards and you roughly double the shots per unit time, which raises the odds that one of them lands. that relationship — shots against success — is the curve the method measures.

where this runs

none of this happens on a laptop. it needs real nvidia cards with enough memory to hold the model plus its kv cache. otis targets multi-gpu runpod pods: spin up a box with two or four cards, start one replica per card, point the runner at them.