Contents · 9 sections
me

I needed to bring up a high-throughput inference server for LLMs with the a library of CUDA kernels for attention and sampling that speeds up inference backend on a cluster of 8× H100s – in an environment with no root, no internet, and a pre-approval process for every package. A normal install falls apart at the very first step, and not because the task is algorithmically hard. It falls apart because the entire toolchain around CUDA silently assumes three things I didn’t have: superuser rights, network access, and a system package manager that resolves dependencies for you. In this article I work through how to rebuild that toolchain by hand – from unpacked RPM packages – and which traps lie along the way.

About the environment this happens in

A large company’s closed network: production servers with no sudo, no internet, an internal package repository, and mandatory pre-approval for anything that lands there. Almost every CUDA install tutorial breaks against this set of constraints. Because tutorials are written for a machine where you’re root and you have internet. I had neither.

Why a plain pip install doesn’t work here

When you need to install FlashInfer, what do you usually do? Right – pip install flashinfer-python (that’s the PyPI package name; there are also separate flashinfer-cubin and flashinfer-jit-cache). In a normal environment this pulls a wheel built for your CUDA-and-GPU combination, or it builds the kernels on the spot against an installed CUDA toolchain. In a closed network both paths break immediately: a prebuilt wheel for the exact combination you need most likely isn’t in the internal repository, and you can’t download one from outside. That leaves building from source – which needs nvcc, CUDA headers, and devel libraries, i.e. a full toolchain.

The reason this is solvable without root at all: the CUDA toolchain and the CUDA driver are different things, and I only needed the first. The driver was already installed on the servers – that’s done by whoever has the rights. The toolchain (compiler, headers, static and shared libraries) is just files. You can drop them anywhere in your $HOME and point at them with environment variables. No root needed for that; you only need a way to extract those files from the packages without running the installer – which is exactly what wants to write into /usr/local and call ldconfig.

The incomplete package: why I had to assemble the toolchain myself

The first attempt was simple: grab cuda-toolkit-12.4.1 from the internal repository. It didn’t work – the package turned out to be incomplete. Its include tree was nothing but symlinks pointing only into cuda_cudart/include; whole components that modern CUDA kernels depend on simply weren’t there. I tried to patch in what was missing by hand – wiring up symlinks and copying headers from the neighboring components:

cp -rsf cuda_cccl/targets/x86_64-linux/include/* include/
cp -rsf cuda_cudart/include/* include/
cp -rsf cuda_nvcc/bin/* bin/

That approach could work, but against an incomplete package it turns into an endless hunt for whatever’s missing next. So I stopped patching 12.4.1 and assembled a complete 12.6 toolchain from scratch – from separate RPM components. The exact patch version 12.6.3 doesn’t matter here: it’s just what I happened to unpack; what matters is that the components come from the same toolkit release / repository channel rather than being mixed arbitrarily across 12.x packages. Since CUDA 11 the toolkit components are versioned individually (mine were cuda-nvcc 12.6.85, cuda-cccl and cuda-cudart 12.6.77), so “same minor” is a useful rule of thumb, not a hard compatibility guarantee.

CUDA without root: rpm2cpio instead of the installer

An RPM is just an archive. You can pull its contents out without root – without even rpm itself in the system – with the rpm2cpio | cpio pair. CUDA in the repository is split by component – one package per component – so I unpacked exactly the ones needed for the build:

mkdir -p cuda-toolkit-12.6.3 && cd cuda-toolkit-12.6.3
rpm2cpio ~/libs/cuda-toolkit-12-6-12.6.3-1.x86_64.rpm       | cpio -idmv
rpm2cpio ~/libs/cuda-nvcc-12-6-12.6.85-1.x86_64.rpm         | cpio -idmv
rpm2cpio ~/libs/cuda-cccl-12-6-12.6.77-1.x86_64.rpm         | cpio -idmv
rpm2cpio ~/libs/cuda-cudart-devel-12-6-12.6.77-1.x86_64.rpm | cpio -idmv

cpio -idmv unpacks the payload into the current directory, preserving the internal path structure. After that the toolchain lands in .../usr/local/cuda-12.6, and that path becomes $CUDA_HOME:

export CUDA_HOME=$HOME/cuda-toolkit-12.6.3/usr/local/cuda-12.6
export PATH=$CUDA_HOME/bin:$PATH

Next – the header-location trap. After unpacking, some of the headers sit not in $CUDA_HOME/include but in targets/x86_64-linux/include/, and the compiler doesn’t look for them there. This happens because the CUDA toolchain is multi-target (it can cross-compile for x86_64, aarch64, and others), so the real store of headers is targets/<triplet>/include, and the familiar $CUDA_HOME/include is, on a normal install, just a symlink to it. That symlink is created by the RPM’s post-install scriptlet, and rpm2cpio | cpio only unpacks files – it doesn’t run scriptlets – so you have to assemble the include/ storefront by hand. The headers need to be merged into the one directory that’s expected:

cp -r $CUDA_HOME/targets/x86_64-linux/include/* $CUDA_HOME/include/

The mechanism worth taking away: splitting CUDA into RPM components means you assemble the dependency graph between them yourself. The installer held that graph and put everything in at once; unpacking by hand, you get exactly the files you named and not one more. A missing header isn’t a build bug – it’s a component you haven’t unpacked yet.

The missing header: nv/target from cuda_cccl

The clearest manifestation of this is an error on one specific header:

nv/target: No such file or directory

nv/target lives in CUDA Core Compute Libraries: Thrust, CUB, libcudacxx – the set of C++ libraries that modern CUDA kernels lean on. In the big .run installer it ships in the bundle, but in the RPM split cuda_cccl is a separate package. The compile error doesn’t name it directly: you only see the missing header, not the package that carries it. This is exactly why the incomplete 12.4.1 wouldn’t build – its truncated include tree had no CCCL headers.

The fix is the same as for the toolchain itself: unpack cuda_cccl, copy its include into $CUDA_HOME/include. After that FlashInfer’s kernels find CCCL – the build logs show nvcc picking them up via -isystem $CUDA_HOME/include/cccl.

Why you don’t need to set sm_90a by hand

The H100 is the Hopper architecture, compute capability 9.0, the Hopper generation . I expected to have to set the target for FlashInfer by hand – the architecture-specific sm_90a. It’s not just an “accelerated” sm_90: the a suffix enables Hopper-only instructions (WGMMA, TMA-related paths) at the cost of portability – such code isn’t forward-compatible with future architectures. FlashInfer’s docs for building flashinfer-jit-cache from source do use FLASHINFER_CUDA_ARCH_LIST set to 9.0a.

As it turned out – not needed. My working config has just:

export TORCH_CUDA_ARCH_LIST="9.0"

FlashInfer sets 90a itself. In the build logs its JIT calls nvcc with -gencode=arch=compute_90a,code=sm_90a and caches the result under ~/.cache/flashinfer/<version>/90a/ – meaning the library picks the Hopper-specific sm_90a target for its own kernels, while TORCH_CUDA_ARCH_LIST governs building the rest (the torch and vLLM extensions).

You do genuinely need to set FlashInfer’s architecture by hand (FLASHINFER_CUDA_ARCH_LIST=9.0a) in mainly three cases:

  1. Building flashinfer-jit-cache from source, when you want to restrict the target list explicitly – exactly the scenario the docs set 9.0a for.
  2. Building in an environment with no visible GPU – a Docker build, CI, a build node without an H100/H200. Autodetection then can’t see Hopper, and you have to name it explicitly.
  3. When you need guaranteed Hopper-specific kernels rather than a fallback path. Some FlashInfer APIs have a separate Hopper backend: a GEMM wrapper with backend="auto", say, picks CUTLASS Hopper when it’s available, but you can pin it explicitly too ("sm90").

My case – a normal install with the GPU present – is exactly the one where autodetection works and a manual target was redundant: it would only have made the build less portable. But if you do see an error like this:

no kernel image is available for execution on the device

that’s precisely the sign that the compiled kernels aren’t for the GPU’s compute capability.

The linker can’t find cudart

With the headers in place, the build reaches the link step and fails:

/usr/bin/ld: cannot find -lcudart
collect2: error: ld returned 1 exit status

Two causes converge here, both consequences of the manual layout. First: build-time linking doesn’t consult the ldconfig cache or LD_LIBRARY_PATH at all – those belong to the runtime loader. Resolving -lcudart is a build-time search the GCC/G++ driver and the linker do together, across explicit -L flags, the directories from LIBRARY_PATH (read by the GCC driver, not by ld itself), and GCC’s configured library paths. Point none of them at your unpacked 12.6 and the link simply never sees it. Second, subtler: the runtime cuda-cudart package ships only the versioned libcudart.so.12, with no unversioned libcudart.so symlink, and the -lcudart flag needs exactly the unversioned name to resolve. The unversioned .so comes in the devel package – which is why I unpacked cuda-cudart-devel above, not just cuda-cudart.

The fix is a combination, not a single line:

export LIBRARY_PATH=$CUDA_HOME/targets/x86_64-linux/lib:$CUDA_HOME/lib64:$LIBRARY_PATH
export LD_LIBRARY_PATH=$CUDA_HOME/targets/x86_64-linux/lib:$CUDA_HOME/lib64:$LD_LIBRARY_PATH
export LDFLAGS="-L$CUDA_HOME/targets/x86_64-linux/lib -L$CUDA_HOME/lib64"

And here’s the detail FlashInfer’s docs don’t call out separately (GCC and binutils document it, of course – but who reads those while debugging an inference engine’s install): LD_LIBRARY_PATH isn’t enough here. That’s the runtime loader’s path – it affects how a library is found when an already-built binary runs. Linking happens at build time, and it needs LIBRARY_PATH and -L. Mixing the two up is easy: you set LD_LIBRARY_PATH, check ldconfig -p | grep libcudart – the library seems to be there. But ldconfig only shows that the runtime loader knows about it; that doesn’t mean the linker will find the right linker name for -lcudart. And what it looks for is specifically the unversioned libcudart.so – if all you have is the versioned libcudart.so.12, linking fails, tripping over the absence of the unversioned .so, not over the path.

FlashInfer, by the way, gives you a proper knob for this: its JIT build passes its own -L.../lib64, -L.../lib64/stubs, -lcudart, -lcuda into ldflags, and you can extend that list with the FLASHINFER_EXTRA_LDFLAGS variable without touching the global LDFLAGS. In my case the environment above was enough, but if your LIBRARY_PATH edits don’t reach the JIT compilation – this is the more targeted way.

The stale JIT cache

A separate problem you’ll wrestle with if you don’t know about it. FlashInfer compiles some kernels just-in-time – kernels compiled on first run rather than at install time and caches the result in ~/.cache/flashinfer, laying it out by its own version and target (~/.cache/flashinfer/0.6.4/90a/). And here’s the key: in my FlashInfer version and build path, changing the underlying CUDA toolkit layout didn’t reliably invalidate the already-compiled JIT artifacts. The cache key may well account for the FlashInfer version, source hashes, flags, and the CUDA architecture – but lower-level changes slip past it: CUDA_HOME, the library layout, a manually repaired toolchain. So when you change something underneath – rebuild CUDA, switch versions, fix paths – the cache is still treated as valid: FlashInfer pulls kernels built for the previous, broken configuration out of it, and your fix never reaches recompilation.

Why that’s bad: you don’t get a new, telling error – you hit the exact same one you were fixing. There’s no separate “stale-cache failure”; there’s the original fault reproduced one-to-one, because what runs is the old artifacts, not your fresh build. That’s the whole trap: you change the right thing, the symptom doesn’t move – and you start suspecting your fix rather than the cache. The cure is to wipe the caches after any CUDA change, and not just flashinfer’s:

rm -rf ~/.cache/flashinfer ~/.cache/vllm ~/.cache/torch_extensions

The generalization is about the whole class of bugs: any JIT cache is hidden state between your configuration and what actually executes. The rule is simple: change the lower layer – invalidate the cache above it, or you’re debugging something other than what runs. This was the single most time-expensive episode of all.

Python 3.11 for this pinned stack, not 3.12

The last constraint – not about CUDA, but about the same manual environment build. The stack pulls in mistral-common, and under Python 3.12 it wouldn’t come up in this particular combination – so I kept the venv on Python 3.11 (cpython-3.11.15, visible in the paths from the logs). This wasn’t a general mistral-common limitation – it formally supports 3.12; it was a compatibility issue in this specific pinned vLLM / FlashInfer / Mistral stack. And since there’s no package index at hand, I installed mistral-common not from the network but built it locally from a tarball of the needed version:

tar -xf ~/libs/v1.9.1.tar.gz
pip install ./mistral-common-1.9.1/

A local but important takeaway: in a closed network the Python version is also a dependency you pin explicitly, like everything else. You don’t control the system one (no root), there’s no index, so a venv on a specific version plus building packages from tarballs isn’t hygiene – it’s the only way.

What follows from this

Building LLM inference in an air-gapped environment isn’t algorithmically hard. The difficulty is that every convenient tool around CUDA silently assumes root, a network, and a package manager – and all three assumptions fall away at once. The installer that held the dependency graph, created the expected symlinks, and wired up the linker paths is gone – and you do all of its work by hand: unpacking components one at a time, reconciling their dependencies, splitting the compile-time and link-time paths across different variables, and cleaning up the cache behind you.

An unexpected upside – this manual build turned out not to be one-off shamanism but a reproducible configuration. That became clear later, and undramatically: another user launched the same model on the same server without repeating any of the above. All it took was setting the right paths in their environment, and it worked. That’s the proof the toolchain is assembled correctly: if the result comes down to a set of variables with paths, then you didn’t “conjure up some magic” – you actually laid all the files out in their places.

The skill you actually build here is seeing the toolchain as a set of files and paths, not as a black box with an “install” button. When you’re the installer, a missing header stops being a bug and becomes simply a package you haven’t unpacked yet.