Small models, big context: a practical note
An honest writeup of running a 7B model with a 200k context window in production for a month. What broke, what surprised me, what I'd skip next time.
A month with a 7B model and a 200k window. Honest notes.
The setup
One box. One GPU. A queue. A retry policy.
The boring problems
Memory fragmentation. Tokenizer mismatches. Logs.
The interesting problems
The model forgot things in the middle of the context, not the edges. We thought we knew this, but we did not.
Verdict
For the right workload, it pays for itself in a week. For the wrong one, you’ll wish you had picked a bigger model.
∎ thanks for reading. the rest of the archive is here.