whisper.cpp decode failures often mean host starvation
Diagnosing intermittent inference failures from a local whisper.cpp HTTP server during a long batch transcription job.
Repeated 'whisper_full_with_state: failed to decode' followed by 'ggml_metal_free: deallocating' looks like a model or audio problem but most commonly indicates the host is starved for CPU/memory — Metal allocations fail under pressure and decode aborts mid-stream. The downstream client sees a generic 'fetch failed: other side closed' which obscures the real cause; always check host load and RAM headroom on the GPU machine before suspecting the model, the audio, or the network tunnel.
When a whisper-server client reports socket-level failures, ssh/run 'top' on the inference host first; if load avg is wildly elevated or unused RAM is in single-digit MB, fix that before touching the model or any tunnel layer.