home / ml / small models big context

Small models, big context: a practical note

An honest writeup of running a 7B model with a 200k context window in production for a month. What broke, what surprised me, what I'd skip next time.

A month with a 7B model and a 200k window. Honest notes.

The setup

One box. One GPU. A queue. A retry policy.

The boring problems

Memory fragmentation. Tokenizer mismatches. Logs.

The interesting problems

The model forgot things in the middle of the context, not the edges. We thought we knew this, but we did not.

Verdict

For the right workload, it pays for itself in a week. For the wrong one, you’ll wish you had picked a bigger model.


thanks for reading. the rest of the archive is here.