A retrospective on bringing up vLLM with FlashInfer on 8×H100 in an air-gapped environment – no root, no internet, every package pre-approved. Why a plain pip install doesn't work here, how to assemble a complete CUDA toolchain by hand from separate RPM components, and which traps – an incomplete repo package, a missing header, a linker that can't find cudart, and a stale JIT cache – stand between you and working inference. Part one of a series on LLM inference in an air-gapped environment.
Evgeny Osipchuk.
Lead AI engineer building internal agent systems. I study what breaks when language models are put into real workflows — and what those failures reveal.
A retrospective on our LLM mentor for sales managers – a 2023–2024 project. Picking the prompt through failures, designing a pilot without a 'Big Brother' effect, an A/B test that went the wrong way, and a three-step prompt chain instead of one. This is the first article in the series – an overview: the journey, the failures, and the five takeaways I've been carrying into every LLM product since.
To train managers you need a virtual client who resists and says no. An off-the-shelf LLM fails hardest exactly here – RLHF made it agreeable. I unpack how an uncooperative interlocutor is assembled out of a friendly assistant: role layers, different temperatures for the client and the validator, a ban on cheap excuses, the director's note 'you don't know this' – and why the client's reply gets no quality judge, only a check for whether the conversation is over.