Example 4: scale to hypercomputing cluster¶
CPM does not have a built-in support for hypercomputing clusters, but it is possible to scale the fitting to a cluster by running the script on each node of the cluster.
On many occasions, where we have a sufficiently large dataset, we may want to fit the model on a hypercomputing cluster to speed up the process.
Here, we will explore how to do this using the sockets
, a built-in library in Python.
This is a crude approach, but it will suit for all day-to-day purposes.
What we essentially do, is to divide the data into chunks and fit each chunk on a different node of the cluster.
This requires an executable script that can be run on each node of the cluster and can be submitted to a slurm cluster.
An example script is provided below, let's call it my_job.py
:
#!/usr/bin/env python3.12
if __name__ == "__main__":
import numpy as np
import socket
import os
import pandas as pd
data = pd.read_csv('bandit_small.csv')
## subset data into chunks
ppt_to_chunks = data.ppt.unique()
# Get the number of nodes available
num_nodes = int(os.getenv("SLURM_JOB_NUM_NODES", 1))
# Nodes will always be one as we are running this on a single node
# Remove the following line if you want to run this on multiple nodes
num_nodes = 4
chunks = np.array_split(ppt_to_chunks, num_nodes)
# Get the hostname of the node
node_name = socket.gethostname()
# Get the SLURM task ID
task_id = int(os.getenv("SLURM_PROCID", 0))
# Some useful information to print
# print(f"Node: {node_name}, Task ID: {task_id}")
# Get the chunk of data that this node will work on
ppt_to_nodes = data.ppt.isin(chunks[0])
print(f"Shape of the current data: {data[ppt_to_nodes].shape}, which is {(data[ppt_to_nodes].shape[0] / data.shape[0])*100}% of the complete data allocated to a single node.")
# Below you can do your job with the data
# do_something(data[ppt_to_nodes])
Shape of the current data: (576, 9), which is 25.0% of the complete data allocated to a single node.
Once the script is ready, we will usually have to write a bash script that will submit the job to the cluster. See an example below:
#!/bin/bash -l
#### Define some basic SLURM properties for this job - there can be many more!
#SBATCH --job-name=my_simulation # Replace with your job name
#SBATCH --nodes=4 # Replace with the number of nodes you need
#SBATCH --partition compute
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=64 # Replace with the number of cores you need
#SBATCH --time=48:00:00 # Replace with the time you need, the format is hour:minutes:seconds
#### This block shows how you would create a working directory to store your job data:
# Define and create a unique scratch directory for this job:
SCRATCH_DIRECTORY=/${USER}/${SLURM_JOBID}
mkdir -p ${SCRATCH_DIRECTORY}/results
if [ $? -ne 0 ]; then
echo "Failed to create scratch directory"
exit 1
fi
cd ${SCRATCH_DIRECTORY}
if [ $? -ne 0 ]; then
echo "Failed to change directory to scratch directory"
exit 1
fi
# Note: ${SLURM_SUBMIT_DIR} contains the path where you started the job
# You can copy everything you need to the scratch directory
# ${SLURM_SUBMIT_DIR} points to the path where this script was submitted from
cp -r ${SLURM_SUBMIT_DIR}/* ${SCRATCH_DIRECTORY}
if [ $? -ne 0 ]; then
echo "Failed to copy files to scratch directory"
exit 1
fi
#### Do the actual work:
# Make sure we have singularity available
module load python
if [ $? -ne 0 ]; then
echo "Failed to load Python module"
exit 1
fi
# Debugging output to verify paths and environment variables
echo "Scratch directory: ${SCRATCH_DIRECTORY}"
echo "Submit directory: ${SLURM_SUBMIT_DIR}"
# Run the simulation using Singularity
srun python my_job.py
if [ $? -ne 0 ]; then
echo "Singularity execution failed"
exit 1
fi
# Note the hostname command in there - this will print the compute node's name into the output, making it easier to understand what's going on
echo "Compute node: $(hostname)"
# After the job is done we copy our output back to $SLURM_SUBMIT_DIR
cp -r ${SCRATCH_DIRECTORY}/results ${SLURM_SUBMIT_DIR}
if [ $? -ne 0 ]; then
echo "Failed to copy results back to submit directory"
exit 1
fi
#### This is how you would clean up the working directory (after copying any important files back!):
cd ${SLURM_SUBMIT_DIR}
rm -rf ${SCRATCH_DIRECTORY}
if [ $? -ne 0 ]; then
echo "Failed to clean up scratch directory"
exit 1
fi
exit 0
This then can be submitted to the cluster using the sbatch
command.
sbatch my_job.sh
That is all, now we have upscaled our simulations to a hypercomputing cluster. Happy coding!