Adding or Modifying Pools Slurm
It’s common to refine or otherwise modify configurations over time. Soon there will be a CLI tool to obviate the need for the following manual steps, which are the current steps required for reconfiguration after the initial integration has been pushed into production.
Navigate to the
exostellar
directory where the configuration assets reside:- CODE
cd ${SLURM_CONF_DIR}/exostellar
Setup a timestamp folder in case there’s a need to rollback:
- CODE
PREVIOUS_DIR=$( date +%Y-%m-%d_%H-%M-%S ) mkdir ${PREVIOUS_DIR}
Place the contents of the
exostellar
directory in the timestamp directory:- CODE
mv * ${PREVIOUS_DIR}
Make a new
json
folder and copy theenv.json
andprofile.json
:- CODE
mkdir json cp ${PREVIOUS_DIR}/json/env0.json ${PREVIOUS_DIR}/json/profile0.json json/ cd json mv env0.json env1.json mv profile0.json profile1.json
Edit
env1.json
as needed, e.g.:Add more pools if you need more CPU-core or Memory options availble in the partition.
Increase the node count in pools.
Environment Configuration Information for reference.
Likely,
profile0.json
may not need any modification.Profile Configuration Information for reference.
Validate the JSON asset with
jq
:- CODE
jq . env1.json
You will see well-formatted JSON if
jq
can read the file, indicating no errors. If you see an error message, that means the JSON is not valid.When the JSON is valid, the file can be pushed to the MGMT_SERVER:
- CODE
curl -d "@env1.json" -H 'Content-Type: application/json' -X PUT http://${MGMT_SERVER_IP}:5000/v1/env
If the profile was changed, validate it with the quick
jq
test.- CODE
jq . profile1.json
Push the changes live:
- CODE
curl -d "@profile1.json" -H 'Content-Type: application/json' -X PUT http://${MGMT_SERVER_IP}:5000/v1/profile
Grab the assets from the MGMT_SERVER:
- CODE
curl -X GET http://${MGMT_SERVER_IP}:5000/v1/xcompute/download/slurm -o slurm.tgz
If the EnvName was changed (above in Edit the Slurm Environment JSON for Your Purposes - Step 2 ), then the following command can be used with your
CustomEnvironmentName
:- CODE
curl -X GET http://${MGMT_SERVER_IP}:5000/v1/xcompute/download/slurm?envName=CustomEnvironmentName -o slurm.tgz
Unpack them into the
exostellar
folder:- CODE
tar xf slurm.tgz -C ../ cd .. mv assets/* . rm assets
Edit
resume_xspot.sh
and addsed
command snippet (| sed "s/XSPOT_NODENAME/$host/g"
) for every pool:- CODE
user_data=$(cat /opt/slurm/etc/xvm16-_user_data | base64 -w 0)
becomes:
- CODE
user_data=$(cat /opt/slurm/etc/exostellar/xvm16-_user_data | sed "s/XSPOT_NODENAME/$host/g" | base64 -w 0)
Introducing new nodes into a slurm cluster requires restart of the slurm control deamon:
- CODE
systemctl restart slurmctld
Integration steps are complete and a job submission to the new partition is the last validation:
As a user, navigate to a valid job submission directory and launch a job as normal, but be sure to specifiy the new partition:
sbatch -p NewPartitionName < job-script.sh