Create Your Own Virtual HPC Cluster
This document outlines the process for setting up a virtualized network for a High Performance Computing cluster. The objective is to establish a virtual network that supports SSH-based management, NFS shared storage, inter-node communication and munge-based authentication for Slurm, with optional support for Salt configuration management. This is intended for interns and students to learn about systems administration and HPC infrastrucutre.
Network Architecture
The HPC cluster is composed of the following VM roles:
VM Name |
Role |
Services |
|---|---|---|
|
SSH, NFS server |
NFS, Munge, Slurm client, SSH |
|
Slurm controller, Munge master |
slurmctld, munge |
|
Compute nodes |
slurmd, munge |
Step 1: Create a Virtual Network
Learn more about using libvirt to create a virtual network by visiting the libvirt wiki
Define Network XML
Create a file and name it something like cluster-net.xml
Your XML file may look something like:
<network>
<name>hpc-cluster-net</name>
<forward mode='nat'/>
<bridge name='virbr10' stp='on' delay='0'/>
<ip address='192.168.100.1' netmask='255.255.255.0'>
<dhcp>
<range start='192.168.100.100' end='192.168.122.200'/>
</dhcp>
</ip>
</network>
Create and Start Network
Use basic command line tools to define and start your cluster.
sudo virsh net-define cluster-net.xml
sudo virsh net-autostart hpc-cluster-net
sudo virsh net-start hpc-cluster-net
Use virsh net-list --all to verify the network is created and started.
Step 2: Create the VMs
Use virt-manager or virt-install to configure and create VMs attached to the network.
- virt-manager is the graphical user front end for Libvirt which provides virtual machine management.
Download virt-install by running:
sudo yum install virt-install
Download ISO
For this tutorial, download the Ubuntu Server ISO, and be sure to move that file to a location that is readable by the qemu user.
sudo mkdir-p /var/lib/libvirt/boot
sudo cp ~/Downloads/ubuntu-22.04.4-live-server-amd64.iso /var/lib/libvirt/boot/
Create your first VM
Use ‘virt-install’ if you prefer to configure network settings from the command line. For example:
sudo virt-install \
--connect qemu:///system \
--name admin \
--memory 2048 \
--vcpus 2 \
--disk size=10 \
--network network=hpc-cluster-net \
--os-variant=ubuntu24.04 \
--noautoconsole
Clone the Admin node
To create your other 2 VMs, you can clone the admin node we just created. First, shut down the admin VM temporarily.
sudo virsh shutdown admin
Verify the VM is shut off by running:
sudo virsh list --all
Clone it to controller and node01
sudo virt-clone --original admin --name controller --file /var/lib/libvirt/images/controller.qcow2
sudo virt-clone --original admin --name node01 --file /var/lib/libvirt/images/node01.qcow2
And start the new VMs
sudo virsh start controller
sudo virsh start node01
Reconfigure VM Hostname
For each new VM, access the console and change the hostname:
sudo hostnamectl set-hostname controller
sudo hostnamectl set-hostname node01
And reboot for hostname to take effect:
sudo reboot
Reset IP on controller and node01
Because we cloned the admin node to the controller and node01, the IP may appear the same for all nodes. To generate new IPs for the controller and node01, login to each node and run:
sudo truncate -s 0 /etc/machine-id
sudo systemd-machine-id-setup
You may also need to delete old DHCP leases (if they exist)
sudo rm -f /var/lib/dhcp/*
sudo rm -f /var/lib/NetworkManager/*lease*
Reboot the VM to let them request new DHCP leases
sudo reboot
Step 3: Set Static IPs and Hostnames
Edit /etc/hosts on all nodes to include each VMs IP and hostname:
192.168.100.1 admin
192.168.100.10 controller
192.168.100.101 node01
Step 4: Setup SSH Key Access
Generate SSH Key on Your Laptop
ssh-keygen -t rsa -b 4096
Copy SSH Key to Nodes
ssh-copy-id user_admin@192.168.100.1 # admin
ssh-copy-id user_admin@192.168.100.10 # controller
ssh-copy-id user_admin@192.168.100.101 # node01
Optional: Setup ~/.ssh/config on Your Laptop
This will allow you to ssh into your nodes from your laptop without open the VM console directly.
Host admin
HostName 192.168.100.1
User user_admin
Host controller
HostName 192.168.100.10
User user_admin
Host node01
HostName 192.168.100.101
User user_admin
Step 5: Configure NFS Server (on Admin)
Install and Export Directories
sudo apt install nfs-kernel-server
sudo mkdir -p /srv/nfs/home
Add to /etc/exports:
/srv/nfs/home 192.168.100.0/24(rw,sync,no_subtree_check,no_root_squash)
Apply exports:
sudo exportfs -a
sudo systemctl restart nfs-server
Step 6: Setup Munge Authentication
Munge must be installed on all nodes to allow authenicated communication between Slurm components.
Install Munge on Nodes
sudo apt update
sudo apt install munge -y
Generate a Munge Key (on admin)
sudo dd if=/dev/urandom bs=1 count=1024 of=/etc/munge/munge.key
sudo cp /etc/munge/munge.key /tmp/munge.key
sudo chmod 644 /tmp/munge.key
Copy the Munge Key Nodes
scp /tmp/munge.key controller:/tmp/
scp /tmp/munge.key node01:/tmp/
Move Munge Key to Permanent Location
sudo mv /tmp/munge.key /etc/munge/munge.key
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 400 /etc/munge/munge.key
Enable and Start Munge on All Nodes
sudo systemctl enable --now munge
Step 7: Install and Configure Slurm
Install Slurm
sudo apt install slurm-wlm
Create Slurm Directories on Controller
sudo mkdir -p /var/spool/slurmctld
sudo chown slurm:slurm /var/spool
sudo chown slurm:slurm /var/spool/slurmctld
sudo chmod 755 /var/spool /var/spool/slurmctld
Add slurm.conf File
ClusterName=hpc-cluster
ControlMachine=controller
SlurmUser=slurm
SlurmdUser=slurm
StateSaveLocation=/var/spool/slurmctld
SlurmdSpoolDir=/var/spool/slurmd
NodeName=node01 CPUs=2 State=UNKNOWN
PartitionName=debug Nodes=node01 Default=YES MaxTime=INFINITE State=UP
Start the Service on Controller
sudo systemsctl enable --now slurmctld
On the Compute Node
sudo mkdir -p /var/spool/slurmd
sudo chown slurm:slurm /var/spool/slurmd
sudo chmod 755 /var/spool/slurmd
Coopy slurm.conf from the Controller
scp controller:/etc/slurm/slurm.conf ~/slurm.conf
sudo mv ~/slurm.conf /etc/slurm/slurm.conf
Start slurmd
sudo systemctl enable --now slurmd
Test Slurm
sinfo
srun hostname
Step 8 (Optional): Add SaltStack for Configuration Management
Download and Install Salt Bootstrap
On the Master (Admin Node):
curl -o bootstrap-salt.sh -L https://github.com/saltstack/salt-bootstrap/releases/latest/download/bootstrap-salt.sh
sudo sh bootstrap-salt.sh -P -M stable 3006.1
On the Minions (Controller and Node01):
curl -o bootstrap-salt.sh -L https://github.com/saltstack/salt-bootstrap/releases/latest/download/bootstrap-salt.sh
sudo sh bootstrap-salt.sh -P stable 3006.1
Enable Salt on Master / Minions
sudo systemctl enable --now salt-master #master
sudo systemctl enable --now salt-minion #minion
Configure the Salt minion
On each Minion, create and add the file /etc/salt/minion.d/master.conf:
master: admin
Then restart the minion:
sudo systemctl restart salt-minion
Accept the Keys
On the Salt Master:
sudo salt-key -L # list pending keys (you'll see each minion by its hostname)
sudo salt-key -A # accept all keys
Test Connection
sudo salt '*' test.ping