Example 4: Scaling to the cluster

Example 4: scale to hypercomputing cluster¶

CPM does not have a built-in support for hypercomputing clusters, but it is possible to scale the fitting to a cluster by running the script on each node of the cluster. On many occasions, where we have a sufficiently large dataset, we may want to fit the model on a hypercomputing cluster to speed up the process. Here, we will explore how to do this using the sockets, a built-in library in Python.

This is a crude approach, but it will suit for all day-to-day purposes. What we essentially do, is to divide the data into chunks and fit each chunk on a different node of the cluster. This requires an executable script that can be run on each node of the cluster and can be submitted to a slurm cluster. An example script is provided below, let's call it my_job.py:

In [23]:

Copied!





#!/usr/bin/env python3.12
if __name__ == "__main__":
    import numpy as np
    import socket
    import os
    import pandas as pd

    data = pd.read_csv('bandit_small.csv')

    ## subset data into chunks
    ppt_to_chunks = data.ppt.unique()

    # Get the number of nodes available
    num_nodes = int(os.getenv("SLURM_JOB_NUM_NODES", 1))

    # Nodes will always be one as we are running this on a single node
    # Remove the following line if you want to run this on multiple nodes
    num_nodes = 4

    chunks = np.array_split(ppt_to_chunks, num_nodes)

    # Get the hostname of the node
    node_name = socket.gethostname()

    # Get the SLURM task ID
    task_id = int(os.getenv("SLURM_PROCID", 0))

    # Some useful information to print
    # print(f"Node: {node_name}, Task ID: {task_id}")

    # Get the chunk of data that this node will work on
    ppt_to_nodes = data.ppt.isin(chunks[0])

    print(f"Shape of the current data: {data[ppt_to_nodes].shape}, which is {(data[ppt_to_nodes].shape[0] / data.shape[0])*100}% of the complete data allocated to a single node.")

    # Below you can do your job with the data
    # do_something(data[ppt_to_nodes])
#!/usr/bin/env python3.12
if __name__ == "__main__":
    import numpy as np
    import socket
    import os
    import pandas as pd

    data = pd.read_csv('bandit_small.csv')

    ## subset data into chunks
    ppt_to_chunks = data.ppt.unique()

    # Get the number of nodes available
    num_nodes = int(os.getenv("SLURM_JOB_NUM_NODES", 1))

    # Nodes will always be one as we are running this on a single node
    # Remove the following line if you want to run this on multiple nodes
    num_nodes = 4

    chunks = np.array_split(ppt_to_chunks, num_nodes)

    # Get the hostname of the node
    node_name = socket.gethostname()

    # Get the SLURM task ID
    task_id = int(os.getenv("SLURM_PROCID", 0))

    # Some useful information to print
    # print(f"Node: {node_name}, Task ID: {task_id}")

    # Get the chunk of data that this node will work on
    ppt_to_nodes = data.ppt.isin(chunks[0])

    print(f"Shape of the current data: {data[ppt_to_nodes].shape}, which is {(data[ppt_to_nodes].shape[0] / data.shape[0])*100}% of the complete data allocated to a single node.")

    # Below you can do your job with the data
    # do_something(data[ppt_to_nodes])

Shape of the current data: (576, 9), which is 25.0% of the complete data allocated to a single node.

Once the script is ready, we will usually have to write a bash script that will submit the job to the cluster. See an example below:

In [ ]:

Copied!





#!/bin/bash -l

#### Define some basic SLURM properties for this job - there can be many more!
#SBATCH --job-name=my_simulation # Replace with your job name
#SBATCH --nodes=4 # Replace with the number of nodes you need
#SBATCH --partition compute
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=64 # Replace with the number of cores you need
#SBATCH --time=48:00:00 # Replace with the time you need, the format is hour:minutes:seconds

#### This block shows how you would create a working directory to store your job data:
# Define and create a unique scratch directory for this job:
SCRATCH_DIRECTORY=/${USER}/${SLURM_JOBID}
mkdir -p ${SCRATCH_DIRECTORY}/results
if [ $? -ne 0 ]; then
    echo "Failed to create scratch directory"
    exit 1
fi

cd ${SCRATCH_DIRECTORY}
if [ $? -ne 0 ]; then
    echo "Failed to change directory to scratch directory"
    exit 1
fi

# Note: ${SLURM_SUBMIT_DIR} contains the path where you started the job

# You can copy everything you need to the scratch directory
# ${SLURM_SUBMIT_DIR} points to the path where this script was submitted from
cp -r ${SLURM_SUBMIT_DIR}/* ${SCRATCH_DIRECTORY}
if [ $? -ne 0 ]; then
    echo "Failed to copy files to scratch directory"
    exit 1
fi

#### Do the actual work:
# Make sure we have singularity available
module load python
if [ $? -ne 0 ]; then
    echo "Failed to load Python module"
    exit 1
fi

# Debugging output to verify paths and environment variables
echo "Scratch directory: ${SCRATCH_DIRECTORY}"
echo "Submit directory: ${SLURM_SUBMIT_DIR}"

# Run the simulation using Singularity
srun python my_job.py
if [ $? -ne 0 ]; then
    echo "Singularity execution failed"
    exit 1
fi

# Note the hostname command in there - this will print the compute node's name into the output, making it easier to understand what's going on
echo "Compute node: $(hostname)"

# After the job is done we copy our output back to $SLURM_SUBMIT_DIR
cp -r ${SCRATCH_DIRECTORY}/results ${SLURM_SUBMIT_DIR}
if [ $? -ne 0 ]; then
    echo "Failed to copy results back to submit directory"
    exit 1
fi

#### This is how you would clean up the working directory (after copying any important files back!):
cd ${SLURM_SUBMIT_DIR}
rm -rf ${SCRATCH_DIRECTORY}
if [ $? -ne 0 ]; then
    echo "Failed to clean up scratch directory"
    exit 1
fi

exit 0
#!/bin/bash -l

#### Define some basic SLURM properties for this job - there can be many more!
#SBATCH --job-name=my_simulation # Replace with your job name
#SBATCH --nodes=4 # Replace with the number of nodes you need
#SBATCH --partition compute
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=64 # Replace with the number of cores you need
#SBATCH --time=48:00:00 # Replace with the time you need, the format is hour:minutes:seconds

#### This block shows how you would create a working directory to store your job data:
# Define and create a unique scratch directory for this job:
SCRATCH_DIRECTORY=/${USER}/${SLURM_JOBID}
mkdir -p ${SCRATCH_DIRECTORY}/results
if [ $? -ne 0 ]; then
    echo "Failed to create scratch directory"
    exit 1
fi

cd ${SCRATCH_DIRECTORY}
if [ $? -ne 0 ]; then
    echo "Failed to change directory to scratch directory"
    exit 1
fi

# Note: ${SLURM_SUBMIT_DIR} contains the path where you started the job

# You can copy everything you need to the scratch directory
# ${SLURM_SUBMIT_DIR} points to the path where this script was submitted from
cp -r ${SLURM_SUBMIT_DIR}/* ${SCRATCH_DIRECTORY}
if [ $? -ne 0 ]; then
    echo "Failed to copy files to scratch directory"
    exit 1
fi

#### Do the actual work:
# Make sure we have singularity available
module load python
if [ $? -ne 0 ]; then
    echo "Failed to load Python module"
    exit 1
fi

# Debugging output to verify paths and environment variables
echo "Scratch directory: ${SCRATCH_DIRECTORY}"
echo "Submit directory: ${SLURM_SUBMIT_DIR}"

# Run the simulation using Singularity
srun python my_job.py
if [ $? -ne 0 ]; then
    echo "Singularity execution failed"
    exit 1
fi

# Note the hostname command in there - this will print the compute node's name into the output, making it easier to understand what's going on
echo "Compute node: $(hostname)"

# After the job is done we copy our output back to $SLURM_SUBMIT_DIR
cp -r ${SCRATCH_DIRECTORY}/results ${SLURM_SUBMIT_DIR}
if [ $? -ne 0 ]; then
    echo "Failed to copy results back to submit directory"
    exit 1
fi

#### This is how you would clean up the working directory (after copying any important files back!):
cd ${SLURM_SUBMIT_DIR}
rm -rf ${SCRATCH_DIRECTORY}
if [ $? -ne 0 ]; then
    echo "Failed to clean up scratch directory"
    exit 1
fi

exit 0

This then can be submitted to the cluster using the sbatch command.

sbatch my_job.sh

That is all, now we have upscaled our simulations to a hypercomputing cluster. Happy coding!