Taming Your Offline HPC Environment

Stargazer ZJ

My compute cluster at work, like most clusters, has compute nodes air-gapped (no internet access) and login nodes Internet connected. A shared GPFS storage is mounted on every node. Moreover, the cluster is managed by Kubernates (k8s), a distributed container manager. Containers are considered disposable so that every file is lost when the container is stopped, except for files on the shared storage.

This setting, while understandable for cybersecurity and scalability, adds inconvenience to the usual ML-based development workflow on a single workstation. In this article I will introduce tips and tricks I use to create a streamlined, “local-like” experience.

Part 1: Portable and Persistent Workspace

Containers are designed to be minimal. Like container housings you see on a construction site, docker containers have only necessary programs to support system operation with minimal resource consumption. Containers are also disposable so that changes to a container are lost permanently when the container exits.

For day-to-day use, you need furnishing to make your container Home, with time-tested utility tools and tweaks to your shell configuration files. Thus, you need to build your own container image (template of a new container).

The standard way to build a container image is to use a Dockerfile. Unfortunately my cluster management platform does not support adding images via Dockerfile. Rather, I have to start with an existing container, make changes interactively and save the current state as a new image.

Therefore, I won’t provide a ready-to-use Dockerfile script here and will only document the key points of doing so.

Install necessary tools

For Ubuntu-based containers, running unminimize brings back most tools and features removed in a minimized docker environment. Afterwards you can install your favourite software packages. Mine are:

1
2
3
4
5
6
7
8
apt-get update
apt-get install -y vim iproute2 tmux autojump aria2 curl wget htop btop zsh git nvtop vmtouch clang gcc g++ cmake 7zip file jq tree iputils-ping
curl -LsSf https://astral.sh/uv/install.sh | sh
uv tool install ninja
uv tool install "huggingface_hub[cli]"
uv tool install wandb
uv tool install nvitop
uv python pin --global 3.12

My cluster has shared storage mounted on /gpfs, and the home directory of the container /root is not shared. Some other clusters share home directory using the HOME variable and proper ownership mapping. If that is the case, this and quite a few other sections of this post can be skipped.

Configuration files, such as .bashrc, are tweaked every time you want to add a new alias, environment variable or shell option. So instead of placing them in the container image, it’s better to let them reside in the shared storage space. This avoids creating images frequently.

But how to let the shell know the existence of these configuration files? Symbolic links can be used to solve this problem. Symbolic links are shortcuts pointing to the actual file location, but act like the real file when read or written. To create one, run

1
ln -s /gpfs/config/.bashrc ~/.bashrc

In case there’s more than one configuration file for different tools, you can write a script that links everything under /gpfs/config.

And that’s it, you log into any new container, and it instantly feels like your terminal.

Part 2: Magic of Caches: Local-like Experience Across Nodes

On distributed clusters, the shared storage system often becomes the only place to pass artifacts between nodes. This adds a lot of inconveniences to your workflow.

Take Huggingface models as an example. On a local workstation one can run vllm serve some/model and the model can be downloaded automatically on demand. On a cluster, however, the overwhelming practice is to first run hf download some/model --local-dir /gpfs/your/project/some_model to download the model on the login node and then switch to the compute node for vllm serve /gpfs/your/project/some_model, referring to the model by its location in the filesystem.

Not only are those long file paths inconvenient, they also impact the portability of your codebase. Hardcoded file paths in scripts and code mean your code cannot run on another cluster without modification and adaptation. Moreover, placing large assets like models in the project directory discourages sharing between projects and cleaning up unused assets, wasting storage space.

Considering convenience, portability and reusability, I’ll introduce three better practices.

Huggingface

In your .bashrc set these environment variables:

1
2
3
shared_home=/gpfs           # path to your shared storage
export HF_HOME=$shared_home/cache/hf
export HF_HUB_OFFLINE=1 # only set this on air-gapped machines

Then, to use a model, simply run hf download some/model on the login node without specifying where to store it. On the compute node, refer to the model by its name just like what you do on a local machine.

This trick works for vllm, verl, and virtually everything in the huggingface transformers ecosystem. In rare circumstances where a folder path is expected, run hf scan-cache to reveal where the model is stored.

uv

uv is a modern Python package and project manager. I like it and you too should prefer uv over conda because uv is:

  • faster: uv downloads faster using concurrency, solves dependency faster because it’s written in Rust.
  • more portable: uv uses standardized pyproject.toml (PEP 621) and lock files for reproducible dependencies. Say goodbye to notorious requirements.txt and proprietary environment.yml.
  • space-efficient: uv uses hard links to share python packages between projects and its cache. The same package in different projects occupies only one package’s worth of space in the filesystem. Because no copying happens, installing the same package the second time is an instant process!

To use it on a distrubuted cluster, set these environment variables:

1
2
3
4
5
export UV_CACHE_DIR=$shared_home/cache/uv
export UV_PYTHON_INSTALL_DIR=$shared_home/cache/uv_python
export UV_PYTHON_PREFERENCE=only-managed
export UV_VENV_SEED=1
export UV_TORCH_BACKEND=cu128 # optional

Then set up python environments on login nodes and use it on compute nodes.

If you are not impressed by uv and still want to use conda, check out mamba for speed and compatibility.

Weights and Biases

On an air-gapped machine, wandb complains that it cannot connect to the server unless you manually enable offline mode. Then it by default stores its data on ./wandb, ./artifacts and ~/.cache/wandb relative to your training script. Obviously, the last folder will cause trouble if you upload wandb logs on the login node using wb sync /path/to/log/dir. wandb stores staging artifacts in ~/.cache/wandb so a file missing error can happen if your code creates a wandb artifact, for example verl. To change this path, set

1
2
export WANDB_MODE=offline          # This improves portability because you don't set offline mode in your code
export WANDB_DATA_DIR=$shared_home/cache/wandb_data

It’s also beneficial to configure wandb to store running logs in a centralized location. This enables you to write scripts or even fancy frontends (code is cheap nowadays!) to synchronize logs on the login node periodically.

1
2
export WANDB_DIR=$shared_home/cache # wandb will create a folder called `wandb` inside
export WANDB_ARTIFACT_DIR=$shared_home/cache/wandb_artifacts

Why it works

What lets the magic happen is caching. On a local machine, temporary items like downloaded models are conventionally (by XDG standard) stored in a directory under ~/.cache. Programs check out the cache to avoid re-downloading. Changing this path to the shared storage essientially brings the convenience of caches across different nodes. Sometimes, you also need variables like HF_HUB_OFFLINE and WANDB_MODE to explicitly disallow internet connections.

It may seem tempting to share ~/.cache, ~/.local, ~/.config or even the entire home directory using symbolic links to the shared storage. But this is discouraged because shared filesystems like GPFS are much slower than server’s local storage, especially for i/o on many small files. This can slow down some random tools that treat .cache as fast temporary storage, even the rendering time of every shell prompt! Also, not all software support symbolic links or concurrent reads or writes on its cache file by different machines.

For the same reason, on a cluster, environment variables are preferred over config files in ~/.config, because this saves a file read every time you run the tool.

Part 3: Pro Tips

Tips on ML Environment Setup

While it’s generally encouraged to set up python environments on login nodes online, some packages require compilation to build its wheel, such as apex, flash_attn and megatron. Compilation does not require a GPU, but needs a CUDA toolkit and ample CPU and memory - something a login node may not offer. In this case, you can either download and install pre-built wheels if there is one (this reduces reproducibility and portability), or build a wheel from source on a compute node.

Using Internal Mirrors

Suddenly want to install a package on an air-gapped node? No problem! Clusters often provide a nexus-based internal mirror for apt, pypi and docker. If there is one, you can set it up in your container image.

Specially, the PyPi mirror should be set in UV_INDEX environment variable because uv pip does not respect pip configuration files. You should also be aware that using the internal mirror to set up python environments reduces reproducibility because the mirror is not present outside the cluster. Also, a PyPi mirror does not cover third-party packages host outside PyPi, like flashinfer (it used to be).

Fail Internet Connections Immediately

Internet isolation is typically achieved through routing rules, resulting in no possible packet transfer between the machine and the Internet. Therefore, any attempt to connect to the internet won’t receive any negative response. This sometimes causes scripts to hang, only timing out after minutes of waiting.

To mitigate this, my proposed solution is to write a custom DNS server and point the system resolver to it. The DNS server returns NXDOMAIN for every non-intranet domain, effectively signaling to every web client that the internet is not accessible the moment it tries to connect.

Miscellaneous

Powerlevel 10k fetches gitstatusd to ~/.cache/gitstatusd to show git status on your prompt. It can’t do so in an air-gapped environment. Therefore, make sure to include this file and make it executable when building your container image.

Autojump stores its database at ~/.local/share/autojump/autojump.txt. I sym-linked it to the shared storage and haven’t yet encountered issues.

Appendix

My Setup

Part of my .zshrc. Adapt to your need.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# User configuration
# setopt no_share_history
HISTSIZE=200000
SAVEHIST=200000
LS_COLORS="ow=01;90:di=01;90"
PATH=$HOME/.local/bin:$PATH
export UV_INDEX=http://nexus.your.mirror/pypi/simple
shared_home=/gpfs
export HF_HOME=$shared_home/cache/hf
export UV_CACHE_DIR=$shared_home/cache/uv
export UV_PYTHON_INSTALL_DIR=$shared_home/cache/uv_python
export UV_PYTHON_PREFERENCE=only-managed
export UV_VENV_SEED=1
export UV_TORCH_BACKEND=cu128
export WANDB_MODE=offline
export WANDB_DIR=$shared_home/cache
export WANDB_ARTIFACT_DIR=$shared_home/cache/wandb_artifacts
export WANDB_DATA_DIR=$shared_home/cache/wandb_data
export HF_HUB_OFFLINE=1
# export RUSTUP_HOME=$shared_home/cache/rust/rustup
# export CARGO_HOME=$shared_home/cache/rust/cargo
unset shared_home

alias hist='fc -lDd'
alias cp="cp -i" # Confirm before overwriting something
alias cdr='cd $([ $# -eq 0 ] && pwd -P || realpath "$1")'
mdcd() { mkdir $1 && cd $1 }
bindkey '^H' backward-kill-word

function act() {
local base_path="${1:-.}" # Use current directory if no argument is provided
# Remove trailing slash if present
base_path="${base_path%/}"
if [ ! -d "$base_path" ]; then
echo "Error: Directory $base_path does not exist"
return 1
fi
if [ -f "$base_path/.venv/bin/activate" ]; then
source "$base_path/.venv/bin/activate"
elif [ -f "$base_path/bin/activate" ]; then
source "$base_path/bin/activate"
else
echo "Error: Could not find activation script in $base_path"
return 1
fi
}

function waitvllm() {
local url=http://localhost:8000
local suffix=/health
local testurl=$url$suffix
if ! curl -s $testurl >/dev/null; then
echo "VLLM is not started, waiting for $testurl ..."
sleep 1
fi
while ! curl -s $testurl >/dev/null; do
sleep 1;
done
}

The setup script that I use to adapt a new container image:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#!/usr/bin/env bash

echo "⚠️ WARNING ⚠️"
echo "This script is designed to be run ONCE on a new cloud instance."
echo "Running it multiple times may cause issues with your package sources."
echo "This script will:"
echo " - Modify your apt sources to use a custom mirror"
echo " - Install Python pip and uv package manager"
echo " - Install various system utilities"
echo ""

# Confirmation prompt
read -p "Do you want to proceed with the setup? (y/n): " confirm
if [[ "$confirm" != "y" && "$confirm" != "Y" ]]; then
echo "Setup cancelled."
exit 1
fi

# Check if backup already exists (indication script was run before)
if [ -f /etc/apt/sources.list.bak ]; then
echo "Warning: /etc/apt/sources.list.bak already exists."
echo "This suggests the script may have been run before."
read -p "Continue anyway? (y/n): " override
if [[ "$override" != "y" && "$override" != "Y" ]]; then
echo "Setup cancelled."
exit 1
fi
fi

# Proceed with the setup
echo "Starting setup..."

cp /etc/apt/sources.list /etc/apt/sources.list.bak
echo "Original sources list backed up to /etc/apt/sources.list.bak"

sed -i 's|http://[^ ]*.ubuntu.com/ubuntu/|http://nexus.your.mirror/ubuntu/|g' /etc/apt/sources.list
echo "APT sources updated to use custom mirror"

echo "Updating package lists..."
apt update

echo "Unminimizing"
unminimize -y

echo "Installing python3-pip..."
apt install -y python3-pip

echo "Installing uv package manager..."
pip install uv -i http://nexus.your.mirror/pypi/simple

echo "Installing system utilities..."
apt install -y vim iproute2 tmux autojump aria2 curl wget htop btop zsh git nvtop vmtouch clang gcc g++ cmake 7zip file jq tree iputils-ping
uv tool install ninja
uv tool install "huggingface_hub[cli]"
uv tool install wandb
uv tool install nvitop
uv python pin --global 3.12

echo "Installing gitstatusd for p10k..."
mkdir -p ~/.cache/gitstatus
cp /gpfs/config/gitstatusd-linux-x86_64 ~/.cache/gitstatus/

echo "Linking autojump database"
mkdir -p ~/.local/share/autojump
ln -s /gpfs/config/autojump/autojump.txt ~/.local/share/autojump/autojump.txt

echo "✅ Setup completed successfully!"
AI-generated scripts that I mentioned

Warning: Untested. Not guranteed to work!

Based on your blog post, here are some useful scripts and tools you mentioned.

These scripts complement the setup described in your blog post and provide practical utilities for managing an air-gapped HPC environment. Make them executable with chmod +x script_name.sh and place them in your $PATH or /gpfs/config/bin/ for easy access.

1. Configuration File Linking Script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#!/usr/bin/env bash
# link_configs.sh - Link all configuration files from shared storage

CONFIG_SOURCE="/gpfs/config"
CONFIG_TARGET="$HOME"

if [ ! -d "$CONFIG_SOURCE" ]; then
echo "Error: Config source directory $CONFIG_SOURCE does not exist"
exit 1
fi

echo "Linking configuration files from $CONFIG_SOURCE to $CONFIG_TARGET"

# Common config files to link
configs=(
".bashrc"
".zshrc"
".vimrc"
".tmux.conf"
".gitconfig"
".p10k.zsh"
)

for config in "${configs[@]}"; do
source_file="$CONFIG_SOURCE/$config"
target_file="$CONFIG_TARGET/$config"

if [ -f "$source_file" ]; then
# Backup existing file if it exists and isn't a symlink
if [ -f "$target_file" ] && [ ! -L "$target_file" ]; then
mv "$target_file" "$target_file.backup.$(date +%s)"
echo "Backed up existing $config"
fi

# Remove existing symlink
[ -L "$target_file" ] && rm "$target_file"

# Create new symlink
ln -s "$source_file" "$target_file"
echo "✓ Linked $config"
else
echo "- Skipped $config (not found in source)"
fi
done

echo "Configuration linking completed!"

2. Simple DNS Server for Air-gapped Nodes

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
#!/usr/bin/env python3
# air_gap_dns.py - Simple DNS server that blocks external domains

import socket
import struct
from threading import Thread

class AirGapDNS:
def __init__(self, port=53, allowed_domains=None):
self.port = port
self.allowed_domains = allowed_domains or [
'localhost',
'.local',
'.internal',
'your.cluster.domain' # Replace with your internal domain
]

def is_allowed_domain(self, domain):
domain = domain.lower()
return any(domain.endswith(allowed) or domain == allowed.lstrip('.')
for allowed in self.allowed_domains)

def create_response(self, query_data, allowed=False):
# Parse query ID
query_id = struct.unpack('!H', query_data[:2])[0]

if allowed:
# Return NOERROR but no records (let system resolver handle it)
flags = 0x8100 # Response, no error
else:
# Return NXDOMAIN
flags = 0x8183 # Response, NXDOMAIN

# Build response header
response = struct.pack('!HHHHHH', query_id, flags, 1, 0, 0, 0)

# Echo back the question section
question_start = 12
question_end = query_data.find(b'\x00', question_start) + 5
response += query_data[question_start:question_end]

return response

def parse_domain(self, query_data):
# Skip header, parse domain name
i = 12
domain_parts = []
while i < len(query_data):
length = query_data[i]
if length == 0:
break
i += 1
domain_parts.append(query_data[i:i+length].decode('utf-8'))
i += length
return '.'.join(domain_parts)

def handle_query(self, data, addr, sock):
try:
domain = self.parse_domain(data)
allowed = self.is_allowed_domain(domain)

if not allowed:
response = self.create_response(data, allowed=False)
sock.sendto(response, addr)
print(f"Blocked: {domain}")
else:
print(f"Allowed: {domain}")
# Don't respond, let it timeout or use real DNS

except Exception as e:
print(f"Error handling query: {e}")

def start(self):
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.bind(('127.0.0.1', self.port))
print(f"Air-gap DNS server running on port {self.port}")
print("Add 'nameserver 127.0.0.1' to /etc/resolv.conf to use")

while True:
try:
data, addr = sock.recvfrom(512)
Thread(target=self.handle_query, args=(data, addr, sock)).start()
except KeyboardInterrupt:
break

sock.close()

if __name__ == "__main__":
dns_server = AirGapDNS()
dns_server.start()

3. WandB Sync Script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#!/usr/bin/env bash
# wandb_sync.sh - Sync offline wandb runs to the cloud

WANDB_DIR="${WANDB_DIR:-/gpfs/cache/wandb}"
LOG_FILE="/gpfs/logs/wandb_sync.log"

log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

sync_runs() {
local run_dir="$1"
local project_name=$(basename "$run_dir")

log "Syncing project: $project_name"

for run_path in "$run_dir"/offline-run-*; do
[ ! -d "$run_path" ] && continue

local run_id=$(basename "$run_path")
log " Syncing run: $run_id"

if wandb sync "$run_path" 2>&1 | tee -a "$LOG_FILE"; then
log " ✓ Successfully synced $run_id"
# Optionally move to synced directory
# mkdir -p "$run_dir/synced"
# mv "$run_path" "$run_dir/synced/"
else
log " ✗ Failed to sync $run_id"
fi
done
}

main() {
if [ ! -d "$WANDB_DIR" ]; then
log "Error: WANDB_DIR $WANDB_DIR not found"
exit 1
fi

log "Starting wandb sync process"

# Find all project directories
for project_dir in "$WANDB_DIR"/*; do
[ ! -d "$project_dir" ] && continue
[ "$(basename "$project_dir")" = "artifacts" ] && continue

if ls "$project_dir"/offline-run-* >/dev/null 2>&1; then
sync_runs "$project_dir"
fi
done

log "Wandb sync completed"
}

# Run with optional project filter
if [ $# -eq 1 ]; then
PROJECT_FILTER="$1"
if [ -d "$WANDB_DIR/$PROJECT_FILTER" ]; then
sync_runs "$WANDB_DIR/$PROJECT_FILTER"
else
log "Project $PROJECT_FILTER not found"
exit 1
fi
else
main
fi

  • Title: Taming Your Offline HPC Environment
  • Author: Stargazer ZJ
  • Created at : 2025-08-12 15:34:19
  • Updated at : 2025-08-16 15:26:24
  • Link: https://ji-z.net/2025/08/12/Taming-Your-Offline-HPC-Environment/
  • License: This work is licensed under CC BY-NC-SA 4.0.
On this page
Taming Your Offline HPC Environment