Small models, big context: a practical note

An honest writeup of running a 7B model with a 200k context window in production for a month. What broke, what surprised me, what I'd skip next time.

A month with a 7B model and a 200k window. Honest notes.

The setup

One box. One GPU. A queue. A retry policy.

Memory fragmentation. Tokenizer mismatches. Logs.

The model forgot things in the middle of the context, not the edges. We thought we knew this, but we did not.

For the right workload, it pays for itself in a week. For the wrong one, you’ll wish you had picked a bigger model.

∎ thanks for reading. the rest of the archive is here.