CPM library
  • Home

Getting Started

  • Installation
  • Troubleshooting
  • Roadmap

Examples

  • Example 1: An associative learning model and blocking
  • Example 2: Reinforcement learning with a two-armed bandit.
  • Example 3: Estimating Empirical Priors
  • Example 4: Scale to hypercomputing cluster
  • Example 5: Estimating meta-d (metacognitive efficiency)

API Reference

  • cpm.generators
  • cpm.models
  • cpm.optimisation
  • cpm.hierarchical
  • cpm.applications
  • cpm.utils
CPM library
  • Examples
  • Example 4: Scale to hypercomputing cluster

Example 4: Scale to hypercomputing cluster¶

CPM does not have a built-in support for hypercomputing clusters, but it is possible to scale the fitting to a cluster by running the script on each node of the cluster. On many occasions, where we have a sufficiently large dataset, we may want to fit the model on a hypercomputing cluster to speed up the process. Here, we will explore how to do this using the sockets, a built-in library in Python.

This is a crude approach, but it will suit for all day-to-day purposes. What we essentially do, is to divide the data into chunks and fit each chunk on a different node of the cluster. This requires an executable script that can be run on each node of the cluster and can be submitted to a slurm cluster. An example script is provided below, let's call it my_job.py:

In [23]:
Copied!
#!/usr/bin/env python3.12
if __name__ == "__main__":
    import numpy as np
    import socket
    import os
    import pandas as pd

    data = pd.read_csv('bandit_small.csv')

    ## subset data into chunks
    ppt_to_chunks = data.ppt.unique()

    # Get the number of nodes available
    num_nodes = int(os.getenv("SLURM_JOB_NUM_NODES", 1))

    # Nodes will always be one as we are running this on a single node
    # Remove the following line if you want to run this on multiple nodes
    num_nodes = 4

    chunks = np.array_split(ppt_to_chunks, num_nodes)

    # Get the hostname of the node
    node_name = socket.gethostname()

    # Get the SLURM task ID
    task_id = int(os.getenv("SLURM_PROCID", 0))

    # Some useful information to print
    # print(f"Node: {node_name}, Task ID: {task_id}")

    # Get the chunk of data that this node will work on
    ppt_to_nodes = data.ppt.isin(chunks[0])

    print(f"Shape of the current data: {data[ppt_to_nodes].shape}, which is {(data[ppt_to_nodes].shape[0] / data.shape[0])*100}% of the complete data allocated to a single node.")

    # Below you can do your job with the data
    # do_something(data[ppt_to_nodes])
#!/usr/bin/env python3.12 if __name__ == "__main__": import numpy as np import socket import os import pandas as pd data = pd.read_csv('bandit_small.csv') ## subset data into chunks ppt_to_chunks = data.ppt.unique() # Get the number of nodes available num_nodes = int(os.getenv("SLURM_JOB_NUM_NODES", 1)) # Nodes will always be one as we are running this on a single node # Remove the following line if you want to run this on multiple nodes num_nodes = 4 chunks = np.array_split(ppt_to_chunks, num_nodes) # Get the hostname of the node node_name = socket.gethostname() # Get the SLURM task ID task_id = int(os.getenv("SLURM_PROCID", 0)) # Some useful information to print # print(f"Node: {node_name}, Task ID: {task_id}") # Get the chunk of data that this node will work on ppt_to_nodes = data.ppt.isin(chunks[0]) print(f"Shape of the current data: {data[ppt_to_nodes].shape}, which is {(data[ppt_to_nodes].shape[0] / data.shape[0])*100}% of the complete data allocated to a single node.") # Below you can do your job with the data # do_something(data[ppt_to_nodes])
Shape of the current data: (576, 9), which is 25.0% of the complete data allocated to a single node.

Once the script is ready, we will usually have to write a bash script that will submit the job to the cluster. See an example below:

In [ ]:
Copied!
#!/bin/bash -l

#### Define some basic SLURM properties for this job - there can be many more!
#SBATCH --job-name=my_simulation # Replace with your job name
#SBATCH --nodes=4 # Replace with the number of nodes you need
#SBATCH --partition compute
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=64 # Replace with the number of cores you need
#SBATCH --time=48:00:00 # Replace with the time you need, the format is hour:minutes:seconds

#### This block shows how you would create a working directory to store your job data:
# Define and create a unique scratch directory for this job:
SCRATCH_DIRECTORY=/${USER}/${SLURM_JOBID}
mkdir -p ${SCRATCH_DIRECTORY}/results
if [ $? -ne 0 ]; then
    echo "Failed to create scratch directory"
    exit 1
fi

cd ${SCRATCH_DIRECTORY}
if [ $? -ne 0 ]; then
    echo "Failed to change directory to scratch directory"
    exit 1
fi

# Note: ${SLURM_SUBMIT_DIR} contains the path where you started the job

# You can copy everything you need to the scratch directory
# ${SLURM_SUBMIT_DIR} points to the path where this script was submitted from
cp -r ${SLURM_SUBMIT_DIR}/* ${SCRATCH_DIRECTORY}
if [ $? -ne 0 ]; then
    echo "Failed to copy files to scratch directory"
    exit 1
fi

#### Do the actual work:
# Make sure we have singularity available
module load python
if [ $? -ne 0 ]; then
    echo "Failed to load Python module"
    exit 1
fi

# Debugging output to verify paths and environment variables
echo "Scratch directory: ${SCRATCH_DIRECTORY}"
echo "Submit directory: ${SLURM_SUBMIT_DIR}"

# Run the simulation using Singularity
srun python my_job.py
if [ $? -ne 0 ]; then
    echo "Singularity execution failed"
    exit 1
fi

# Note the hostname command in there - this will print the compute node's name into the output, making it easier to understand what's going on
echo "Compute node: $(hostname)"

# After the job is done we copy our output back to $SLURM_SUBMIT_DIR
cp -r ${SCRATCH_DIRECTORY}/results ${SLURM_SUBMIT_DIR}
if [ $? -ne 0 ]; then
    echo "Failed to copy results back to submit directory"
    exit 1
fi

#### This is how you would clean up the working directory (after copying any important files back!):
cd ${SLURM_SUBMIT_DIR}
rm -rf ${SCRATCH_DIRECTORY}
if [ $? -ne 0 ]; then
    echo "Failed to clean up scratch directory"
    exit 1
fi

exit 0
#!/bin/bash -l #### Define some basic SLURM properties for this job - there can be many more! #SBATCH --job-name=my_simulation # Replace with your job name #SBATCH --nodes=4 # Replace with the number of nodes you need #SBATCH --partition compute #SBATCH --output=%x_%j.out #SBATCH --error=%x_%j.err #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=64 # Replace with the number of cores you need #SBATCH --time=48:00:00 # Replace with the time you need, the format is hour:minutes:seconds #### This block shows how you would create a working directory to store your job data: # Define and create a unique scratch directory for this job: SCRATCH_DIRECTORY=/${USER}/${SLURM_JOBID} mkdir -p ${SCRATCH_DIRECTORY}/results if [ $? -ne 0 ]; then echo "Failed to create scratch directory" exit 1 fi cd ${SCRATCH_DIRECTORY} if [ $? -ne 0 ]; then echo "Failed to change directory to scratch directory" exit 1 fi # Note: ${SLURM_SUBMIT_DIR} contains the path where you started the job # You can copy everything you need to the scratch directory # ${SLURM_SUBMIT_DIR} points to the path where this script was submitted from cp -r ${SLURM_SUBMIT_DIR}/* ${SCRATCH_DIRECTORY} if [ $? -ne 0 ]; then echo "Failed to copy files to scratch directory" exit 1 fi #### Do the actual work: # Make sure we have singularity available module load python if [ $? -ne 0 ]; then echo "Failed to load Python module" exit 1 fi # Debugging output to verify paths and environment variables echo "Scratch directory: ${SCRATCH_DIRECTORY}" echo "Submit directory: ${SLURM_SUBMIT_DIR}" # Run the simulation using Singularity srun python my_job.py if [ $? -ne 0 ]; then echo "Singularity execution failed" exit 1 fi # Note the hostname command in there - this will print the compute node's name into the output, making it easier to understand what's going on echo "Compute node: $(hostname)" # After the job is done we copy our output back to $SLURM_SUBMIT_DIR cp -r ${SCRATCH_DIRECTORY}/results ${SLURM_SUBMIT_DIR} if [ $? -ne 0 ]; then echo "Failed to copy results back to submit directory" exit 1 fi #### This is how you would clean up the working directory (after copying any important files back!): cd ${SLURM_SUBMIT_DIR} rm -rf ${SCRATCH_DIRECTORY} if [ $? -ne 0 ]; then echo "Failed to clean up scratch directory" exit 1 fi exit 0

This then can be submitted to the cluster using the sbatch command.

sbatch my_job.sh

That is all, now we have upscaled our simulations to a hypercomputing cluster. Happy coding!

Previous Next

Built with MkDocs using a theme provided by Read the Docs.
« Previous Next »