Customizing the Slurm Application Environment
Edit the Slurm Environment JSON for Your Purposes:
Copy
default-slurm-env.json
to something convenient likeenv0.json
.- CODE
cp default-slurm-env.json env0.json
Note Line numbers listed below reference the above example file. Once changes start being made on the system, the line numbers may change.
Line 2 :
"EnvName"
is set toslurm
by default, but you can specify something unique if needed.NOTE: Currently,
-
characters are not supported in values forEnvName
.
Lines 5-20 can be modified for a single pool of identical compute resources or they can be duplicated and then modified for each “hardware” configuration or “pool” you choose. When duplicating, be sure to add a comma after the brace on line 17, except when it is the last brace, or the final pool declaration.
PoolName
: This will be the apparent hostnames of the compute resources provided for slurm.It is recommended that all pools share a common trunk or base in each
PoolName
.
PoolSize
: This is the maximum number of these compute resources.ProfileName
: This is the default profile name,az1
: If this is changed, you will need to carry the change forward.CPUs
: This is the targeted CPU-core limit for this "hardware" configuration or pool.ImageName
: This is tied to the AMI that will be used for your compute resources. This name will be used in subsequent steps.MaxMemory
: This is the targeted memory limit for this "hardware" configuration or pool.MinMemory
: reserved for future use; can be ignored currently.UserData
: This string is a base64 encoded version of user_data.To generate it:
cat user_data.sh | base64 -w 0
To decode it:
echo "<LongBase64EncodedString>" | base64 -d
It’s not required to be perfectly fine-tuned at this stage; it will be refined and corrected later.
You may format
user_data.sh
in the usual ways:Simple slurm example:
CODE#!/bin/bash set -x #export SLURM_BIN_DIR=/opt/slurm/bin export SLURM_BIN_DIR=/usr/bin hostname XSPOT_NODENAME ${SLURM_BIN_DIR}/scontrol update nodename=XSPOT_NODENAME nodeaddr=`hostname -I | cut -d" " -f1` systemctl start slurmd
APC Example:
- CODE
#!/bin/bash set -x APCHEAD=XXX.XX.X.XXX #enter APC Head Node IP Address ###### hostname XSPOT_NODENAME #For trouble shooting #echo root:TroubleShooting |chpasswd #sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/g' /etc/sshd/sshd_config #sed -i 's/UsePAM yes/UsePAM no/g' /etc/sshd/sshd_config sed -i 's/#PermitRootLogin yes/PermitRootLogin yes/g' /etc/ssh/sshd_config echo 'ssh-rsa 0101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010101010 root@APCHEAD' >> /root/.ssh/authorized_keys systemctl restart sshd mkdir -p /home /opt/parallelcluster/shared /opt/intel /opt/slurm for i in /home /opt/parallelcluster/shared /opt/intel /opt/slurm ; do echo Mounting ${APCHEAD}:${i} ${i} mount -t nfs ${APCHEAD}:${i} ${i} echo Mounting ${APCHEAD}:${i} ${i} : SUCCESS. done mkdir /exoniv echo 'fs-0553a8e956ccff4da.efs.us-east-1.amazonaws.com:/ /exoniv nfs4 nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=30,retrans=2,noresvport,_netdev 0 0' >> /etc/fstab mount -a #add local users, real users, and/or testing users groupadd -g 899 exo useradd -u 1001 -g 899 krs groupadd -g 401 slurm groupadd -g 402 munge useradd -g 401 -u 401 slurm useradd -g 402 -u 402 munge rpm -ivh /opt/parallelcluster/shared/munge/x86_64/munge-0.5.14-1.el7.x86_64.rpm cp -p /opt/parallelcluster/shared/munge/munge.key /etc/munge/ chown munge.munge /etc/munge /var/log/munge mkdir -p /var/spool/slurmd chown slurm.slurm /var/spool/slurmd sleep 5 systemctl start munge if [[ $? -ne 0 ]]; then sleep 10 systemctl start munge fi SLURM_BIN_PATH=/opt/slurm/bin SLURM_SBIN_PATH=/opt/slurm/sbin SLURM_CONF_DIR=/opt/slurm/etc ${SLURM_BIN_PATH}/scontrol update nodename=XSPOT_NODENAME nodeaddr=`hostname -I | cut -d" " -f1` #systemctl start slurmd ${SLURM_SBIN_PATH}/slurmd -f ${SLURM_CONF_DIR}/slurm.conf -N XSPOT_NODENAME
VolumeSize: reserved for future use; can be ignored currently.
Lines 24, 25, 26 should be modified for your slurm environment and according to your preference for the partition name.
BinPath
: This is wherescontrol
,squeue
, and other slurm binaries exist.ConfPath
: This is whereslurm.conf
resides.PartitionName
: This is for naming the new partition.
All other fields/lines in this asset can be ignored.