Taming Your Offline HPC Environment

My compute cluster at work, like most clusters, has compute nodes air-gapped (no internet access) and login nodes Internet connected. A shared GPFS storage is mounted on every node. Moreover, the cluster is managed by Kubernates (k8s), a distributed container manager. Containers are considered disposable so that every file is lost when the container is stopped, except for files on the shared storage.
This setting, while understandable for cybersecurity and scalability, adds inconvenience to the usual ML-based development workflow on a single workstation. In this article I will introduce tips and tricks I use to create a streamlined, “local-like” experience.
Part 1: Portable and Persistent Workspace
Containers are designed to be minimal. Like container housings you see on a construction site, docker containers have only necessary programs to support system operation with minimal resource consumption. Containers are also disposable so that changes to a container are lost permanently when the container exits.
For day-to-day use, you need furnishing to make your container Home, with time-tested utility tools and tweaks to your shell configuration files. Thus, you need to build your own container image (template of a new container).
The standard way to build a container image is to use a Dockerfile
. Unfortunately my cluster management platform does not support adding images via Dockerfile
. Rather, I have to start with an existing container, make changes interactively and save the current state as a new image.
Therefore, I won’t provide a ready-to-use Dockerfile
script here and will only document the key points of doing so.
Install necessary tools
For Ubuntu-based containers, running unminimize
brings back most tools and features removed in a minimized docker environment. Afterwards you can install your favourite software packages. Mine are:
1 | apt-get update |
Link configuration files
My cluster has shared storage mounted on /gpfs
, and the home directory of the container /root
is not shared. Some other clusters share home directory using the HOME
variable and proper ownership mapping. If that is the case, this and quite a few other sections of this post can be skipped.
Configuration files, such as .bashrc
, are tweaked every time you want to add a new alias, environment variable or shell option. So instead of placing them in the container image, it’s better to let them reside in the shared storage space. This avoids creating images frequently.
But how to let the shell know the existence of these configuration files? Symbolic links can be used to solve this problem. Symbolic links are shortcuts pointing to the actual file location, but act like the real file when read or written. To create one, run
1 | ln -s /gpfs/config/.bashrc ~/.bashrc |
In case there’s more than one configuration file for different tools, you can write a script that links everything under /gpfs/config
.
And that’s it, you log into any new container, and it instantly feels like your terminal.
Part 2: Magic of Caches: Local-like Experience Across Nodes
On distributed clusters, the shared storage system often becomes the only place to pass artifacts between nodes. This adds a lot of inconveniences to your workflow.
Take Huggingface models as an example. On a local workstation one can run vllm serve some/model
and the model can be downloaded automatically on demand. On a cluster, however, the overwhelming practice is to first run hf download some/model --local-dir /gpfs/your/project/some_model
to download the model on the login node and then switch to the compute node for vllm serve /gpfs/your/project/some_model
, referring to the model by its location in the filesystem.
Not only are those long file paths inconvenient, they also impact the portability of your codebase. Hardcoded file paths in scripts and code mean your code cannot run on another cluster without modification and adaptation. Moreover, placing large assets like models in the project directory discourages sharing between projects and cleaning up unused assets, wasting storage space.
Considering convenience, portability and reusability, I’ll introduce three better practices.
Huggingface
In your .bashrc
set these environment variables:
1 | shared_home=/gpfs # path to your shared storage |
Then, to use a model, simply run hf download some/model
on the login node without specifying where to store it. On the compute node, refer to the model by its name just like what you do on a local machine.
This trick works for vllm
, verl
, and virtually everything in the huggingface transformers
ecosystem. In rare circumstances where a folder path is expected, run hf scan-cache
to reveal where the model is stored.
uv
uv is a modern Python package and project manager. I like it and you too should prefer uv over conda because uv is:
- faster: uv downloads faster using concurrency, solves dependency faster because it’s written in Rust.
- more portable: uv uses standardized
pyproject.toml
(PEP 621) and lock files for reproducible dependencies. Say goodbye to notoriousrequirements.txt
and proprietaryenvironment.yml
. - space-efficient: uv uses hard links to share python packages between projects and its cache. The same package in different projects occupies only one package’s worth of space in the filesystem. Because no copying happens, installing the same package the second time is an instant process!
To use it on a distrubuted cluster, set these environment variables:
1 | export UV_CACHE_DIR=$shared_home/cache/uv |
Then set up python environments on login nodes and use it on compute nodes.
If you are not impressed by uv and still want to use conda
, check out mamba for speed and compatibility.
Weights and Biases
On an air-gapped machine, wandb
complains that it cannot connect to the server unless you manually enable offline mode. Then it by default stores its data on ./wandb
, ./artifacts
and ~/.cache/wandb
relative to your training script. Obviously, the last folder will cause trouble if you upload wandb logs on the login node using wb sync /path/to/log/dir
. wandb
stores staging artifacts in ~/.cache/wandb
so a file missing error can happen if your code creates a wandb
artifact, for example verl
. To change this path, set
1 | export WANDB_MODE=offline # This improves portability because you don't set offline mode in your code |
It’s also beneficial to configure wandb to store running logs in a centralized location. This enables you to write scripts or even fancy frontends (code is cheap nowadays!) to synchronize logs on the login node periodically.
1 | export WANDB_DIR=$shared_home/cache # wandb will create a folder called `wandb` inside |
Why it works
What lets the magic happen is caching. On a local machine, temporary items like downloaded models are conventionally (by XDG standard) stored in a directory under ~/.cache
. Programs check out the cache to avoid re-downloading. Changing this path to the shared storage essientially brings the convenience of caches across different nodes. Sometimes, you also need variables like HF_HUB_OFFLINE
and WANDB_MODE
to explicitly disallow internet connections.
It may seem tempting to share ~/.cache
, ~/.local
, ~/.config
or even the entire home directory using symbolic links to the shared storage. But this is discouraged because shared filesystems like GPFS are much slower than server’s local storage, especially for i/o on many small files. This can slow down some random tools that treat .cache
as fast temporary storage, even the rendering time of every shell prompt! Also, not all software support symbolic links or concurrent reads or writes on its cache file by different machines.
For the same reason, on a cluster, environment variables are preferred over config files in ~/.config
, because this saves a file read every time you run the tool.
Part 3: Pro Tips
Tips on ML Environment Setup
While it’s generally encouraged to set up python environments on login nodes online, some packages require compilation to build its wheel, such as apex
, flash_attn
and megatron
. Compilation does not require a GPU, but needs a CUDA toolkit and ample CPU and memory - something a login node may not offer. In this case, you can either download and install pre-built wheels if there is one (this reduces reproducibility and portability), or build a wheel from source on a compute node.
Using Internal Mirrors
Suddenly want to install a package on an air-gapped node? No problem! Clusters often provide a nexus-based internal mirror for apt
, pypi
and docker
. If there is one, you can set it up in your container image.
Specially, the PyPi mirror should be set in UV_INDEX
environment variable because uv pip
does not respect pip
configuration files. You should also be aware that using the internal mirror to set up python environments reduces reproducibility because the mirror is not present outside the cluster. Also, a PyPi mirror does not cover third-party packages host outside PyPi, like flashinfer
(it used to be).
Fail Internet Connections Immediately
Internet isolation is typically achieved through routing rules, resulting in no possible packet transfer between the machine and the Internet. Therefore, any attempt to connect to the internet won’t receive any negative response. This sometimes causes scripts to hang, only timing out after minutes of waiting.
To mitigate this, my proposed solution is to write a custom DNS server and point the system resolver to it. The DNS server returns NXDOMAIN
for every non-intranet domain, effectively signaling to every web client that the internet is not accessible the moment it tries to connect.
Miscellaneous
Powerlevel 10k fetches gitstatusd
to ~/.cache/gitstatusd
to show git status on your prompt. It can’t do so in an air-gapped environment. Therefore, make sure to include this file and make it executable when building your container image.
Autojump stores its database at ~/.local/share/autojump/autojump.txt
. I sym-linked it to the shared storage and haven’t yet encountered issues.
Appendix
My Setup
Part of my .zshrc
. Adapt to your need.
1 | # User configuration |
The setup script that I use to adapt a new container image:
1 |
|
AI-generated scripts that I mentioned
Warning: Untested. Not guranteed to work!
Based on your blog post, here are some useful scripts and tools you mentioned.
These scripts complement the setup described in your blog post and provide practical utilities for managing an air-gapped HPC environment. Make them executable with chmod +x script_name.sh
and place them in your $PATH
or /gpfs/config/bin/
for easy access.
1. Configuration File Linking Script
1 |
|
2. Simple DNS Server for Air-gapped Nodes
1 | #!/usr/bin/env python3 |
3. WandB Sync Script
1 |
|
- Title: Taming Your Offline HPC Environment
- Author: Stargazer ZJ
- Created at : 2025-08-12 15:34:19
- Updated at : 2025-08-16 15:26:24
- Link: https://ji-z.net/2025/08/12/Taming-Your-Offline-HPC-Environment/
- License: This work is licensed under CC BY-NC-SA 4.0.