This text file describes the advanced features of SLURM for VE. It curr- ently supports the following advanced features: - VEs Health Check - VE Accounting (1) VEs Health Check (1-1) Overview You can perform a health check to see if the VE node has a failure on the compute node that installs the VEs. If a failure is detected, the compute node and the job can be treated according to the sett- ings. It supports the following two types of functions: 1) Health Check at Job Start and End 2) Periodic Health Checks (1-2) Health Check at Job Start and End If VE is requested as gres at job start (job allocation) and end, Prolog and Epilog are used to check for failures in the allocated VEs. If a failure is detected on one or more VEs, the following two action modes can be selected. Even if a failure is detected on one VE, the health check is not interrupted and all allocated VEs will be checked. Jobs that do not specify VEs do not perform VE health checks. Action Mode 1 can be used for operations where you want to continue using the healthy VEs and CPU cores of the compute nodes even if some VEs fail. Action Mode 2 can be used in operations where the cause of the failure is immediately investigated and repaired when a failure occurs. Users should select the action mode according to their own operational policies. The default is Action Mode 1. - Action Mode 1(Compute Node Operation Continuation Mode) Operation will continue without taking any special treatment for the compute node. If a failure is detected in the health check at job start, the job that triggered the health check will be requeued. However, in the health check at job end, since the job that trigg- ered the health check has already ended, it will not be requeued even if a failure is detected. If the option to notify the user when a job is requeued(--mail-type) is specified when the job is submitted, for the requeued job, a noti- fication will be sent according to the user notification settings. - Action Mode 2(Compute Node Operation Stop Mode) Treat the entire compute node as a failed node, set it to the DOWN state, and stop operation. All jobs running on that node will be requeued. User notification is the same as Action Mode 1. (1-2-1) Health check settings The VEs health check is performed by shell script. The sample scr- ipt is installed under /etc/slurm/ when installing the slurm-for_ve-22.05.2-1.el7.x86_64.rpm. The sample script name is ve_healthchk_for_job.sh.example. Follow the steps below to set it with root privileges. 1) Rename /etc/slurm/ve_healthchk_for_job.sh.example to /etc/slurm/ve_healthchk_for_job.sh. 2) Change the file permissions to 755 with the chmod command. Example: chmod 755 ve_healthchk_for_job.sh 3) Set the full path of the health check script to the Prolog and Epilog options in /etc/slurm/slurm.conf. Example:Prolog=/etc/slurm/ve_healthchk_for_job.sh Epilog=/etc/slurm/ve_healthchk_for_job.sh 4) Set Alloc to PrologFlags option in /etc/slurm/slurm.conf so that the health check script runs when the job is allocated. 5) Open the health check script file with vi etc., and set the action mode to match your operational policy. The default is Action Mode 1. For Action Mode 1: TROUBLESHOOTING_MODE=1 For Action Mode 2: TROUBLESHOOTING_MODE=2 6) Run the "scontrol reconfigure" command to apply the settings. (1-2-2) Return value of health check shell script The health check shell script always exits with exit 0 regardless of whether a VE has failed. If a VE failure is detected, the job will be requeued and the state of the compute node will be updated in the health check script according to the action mode setting, but Prolog and Epilog will always succeed. (1-2-3) Actions for compute nodes - Action Mode 1 Operation continues without taking any treatment for the failure compute node. However, if left unattended, a faulty VE may be ass- igned to the job. Therefore, when a failure is detected, delete the VE from /etc/slurm/slurm.conf and /etc/slurm/gres.conf. Example: When 8VE is installed on compute node vhost1 and VE0 fails Reduce the number of VEs in /etc/slurm/slurm.conf. Before: NodeName=vhost1 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128828 Gres=ve:10b:8,hca:2 State=UNKNOWN ^^^^^^^^^^^^^ After: NodeName=vhost1 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128828 Gres=ve:10b:7,hca:2 State=UNKNOWN ^^^^^^^^^^^^^ Remove the VE from /etc/slurm/gres.conf. Before: NodeName=vhost1 Name=ve Type=10b File=/dev/veslot[0-7] ^^^^^^^^^^^^^^^^ After: NodeName=vhost1 Name=ve Type=10b File=/dev/veslot[1-7] ^^^^^^^^^^^^^^^^ - Action Mode 2 The failure compute node is changed to the DOWN state. You can check the node state with "scontrol show node" command. The foll- owing message will be displayed in the "Reason" column. "VEs health check fail for job at " Description: is prolog_slurmd or epilog_slurmd. (1-2-4) Actions for jobs - Action Mode 1 In the case of a health check at job start(Prolog), the job that triggered the health check is requeued. The job is assigned to another healthy compute node. - Action Mode 2 All jobs running on the compute node are requeued. The job will be assigned to another healthy compute node. (1-2-5) User notification If you want to notify the user of jobs requeued by health checks when a VE failure is detected, set REQUEUE to --mail-type when submitting jobs. (1-2-6) Log output The health check script outputs the following messages to /var/log/messages. The tag name of the message is "ve_healthchk_for_job.sh". 1) Message when a VE failure is detected The VE that has failed and needs to be repaired can be determined from this message. "[JobId=,exit_status=,device=VE] Failed to check node health at ." Description: is the ID of the job that triggers the health check. is a detailed code representing the cause of the failure. The detailed values are as follows: 2 A temporary file for failure detection cannot be created at job start check. 3 VEOS status was not ONLINE. 4 The temporary file for failure detection was not found at job end check. is the VE number of the failed VE. is the timing of the health check. The value is prolog_slurmd or epilog_slurmd. 2) Message indicating the action mode "[JobId=,NodeName=,ActionMode=]Take action due to VEs health check failure." Description: is the ID of the job that triggers the health check. is the name of the compute node where the VE node failed. is the value of the action mode. It can be 0 or 1. 3) Message that failed to requeue a job that triggered a health check "[JobId=]Failed to requeue job due to VEs health check failure." Description: is the ID of the job that triggers the health check. 4) Message that failed to update the failed compute node to the DOWN state "[NodeName= JobId=]Failed to update the state of the compute node to DOWN due to VEs health check failure." Description: is the name of the compute node where the VE node failed. is the ID of the job that triggers the health check. 5) Message that skipped the health check because the command to check the status of VEOS cannot be executed "[JobId=,Timing=]The command for VE health check does not exist or it does not have execute permission. Skip VEs health check." Description: is the ID of the job that triggers the health check. is the timing of the health check. The value is prolog_slurmd or epilog_slurmd. 6) Message when health check timing is not Prolog and Epilog "The value of check timing is abnormal. (SLURM_SCRIPT_CONTEXT=)" Description: is the value of the environ- ment variable SLURM_SCRIPT_CONTEXT passed from SLURM. The value will be other than prolog_slurmd or epilog_slurmd. (1-2-7) Notes 1) In the case of Action Mode 1, if the compute node is not treated when a failure is detected, the health check failure of the same job will be repeated on the same compute node. If you want to resolve the condition, the system administrator should take one of the following actions: - Remove the settings of the failed VE from the configuration of the compute node. - Update the compute node to the DOWN state and remove it from operation. 2) Make sure to set the health check script to both Prolog and Epilog in /etc/slurm/slurm.conf. If only one is set, failure de- tection may not be possible. 3) In the case of Action Mode 2, when a VE failure is detected at job start, the compute node is updated to the DOWN state, but the job executed on that node may not be executed correctly even if it is requeued. Therefore, delete the job with scancel and submit it again. 4) The exit code of the health check script is always 0 even if a VE failure is detected. It is not possible to determine whether a failure has been detected by the exit code of the script. (1-3) Periodic Health Checks Periodically check for failures in the VEs configured for the compute nodes that install the VE. If a failure is detected on one VE, the following two action modes can be selected, and the other configured VEs are not checked. Action Mode 1 can be used for operations where you want to continu- eusing the healthy VEs and CPU cores of the compute nodes even if some VEs fail. Action Mode 2 can be used in operations where the cause of the failure is immediately investigated and repaired when a failure occurs. Users should select the action mode according to their own operational policies. The default is Action Mode 1. - Action Mode 1(Compute Node Operation Continuation Mode) Operation will continue without taking any special treatment for the compute node. Nothing is done to jobs running on the compute node. - Action Mode 2(Compute Node Operation Stop Mode) Treat the entire compute node as a failed node, set it to the DOWN state, and stop operation. All jobs running on that node will be requeued. If the option to notify the user when a job is requeued(--mail-type) is specified when the job is submitted, for the requeued job, a notification will be sent according to the user notification settings. Only VEs configured in /etc/slurm/gres.conf are checked. If the state of a compute node is already DOWN or DRAIN before the health check, the health check is not performed on that node. (1-3-1) Health check settings The VEs health check is performed by shell script. The sample script is installed under /etc/slurm/ when installing the slurm-for_ve-22.05.2-1.el7.x86_64.rpm. The sample script name is ve_healthchk_for_node.sh.example. Follow the steps below to set it with root privileges. 1) Rename /etc/slurm/ve_healthchk_for_node.sh.example to /etc/slurm/ve_healthchk_for_node.sh. 2) Change the file permissions to 755 with the chmod command. Example: chmod 755 ve_healthchk_for_node.sh 3) Set the full path of the health check script to the HealthCheckProgram option in /etc/slurm/slurm.conf. Example:HealthCheckProgram=/etc/slurm/ve_healthchk_for_node.sh 4) Set health check interval in seconds to HealthCheckInterval option in /etc/slurm/slurm.conf. Example: HealthCheckInterval=300 5) Set HealthCheckNodeState option in /etc/slurm/slurm.conf to the state of the compute node that needs to be checked. Normally please set ANY to check without omission. Example: HealthCheckNodeState=ANY 6) Open the health check script file with vi etc., and set the action mode to match your operational policy. The default is Action Mode 1. For Action Mode 1: TROUBLESHOOTING_MODE=1 For Action Mode 2: TROUBLESHOOTING_MODE=2 7) Run the "scontrol reconfigure" command to apply the settings. (1-3-2) Return value of health check shell script If no VE failure is detected, the script exits with exit 0. If a VE failure is detected, the script exits with exit 1. (1-3-3) Actions for compute nodes - Action Mode 1 Operation continues without taking any treatment for the failure compute node. However, if left unattended, a faulty VE may be ass- igned to the job. Therefore, when a failure is detected, delete the VE from /etc/slurm/slurm.conf and /etc/slurm/gres.conf. Example: When 8VE is installed on compute node vhost1 and VE0 fails Reduce the number of VEs in /etc/slurm/slurm.conf. Before: NodeName=vhost1 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128828 Gres=ve:10b:8,hca:2 State=UNKNOWN ^^^^^^^^^^^^^ After: NodeName=vhost1 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128828 Gres=ve:10b:7,hca:2 State=UNKNOWN ^^^^^^^^^^^^^ Remove the VE from /etc/slurm/gres.conf. Before: NodeName=vhost1 Name=ve Type=10b File=/dev/veslot[0-7] ^^^^^^^^^^^^^^^^ After: NodeName=vhost1 Name=ve Type=10b File=/dev/veslot[1-7] ^^^^^^^^^^^^^^^^ - Action Mode 2 The failure compute node is changed to the DOWN state. You can check the node state with "scontrol show node" command. The foll- owing message will be displayed in the "Reason" column. "Periodic VEs health check failure" (1-3-4) Actions for jobs - Action Mode 1 Do nothing for jobs running on the compute node. The system ad- ministrator should check the VE number assigned to the job and the VE number where the failure occurred, and requeue the job with the scontrol command if necessary. - Action Mode 2 All jobs running on the compute node are requeued. The job will be assigned to another healthy compute node. (1-3-5) User notification If you want to notify the user of jobs requeued by health checks when a VE failure is detected, set REQUEUE to --mail-type when submitting jobs. (1-3-6) Log output The health check script outputs the following messages to /var/log/messages. The tag name of the message is "ve_healthchk_for_node.sh". 1) Message when a VE failure is detected The VE that has failed and needs to be repaired can be determined from this message. "[device=VE] Failed to check node health periodically.(reason:)" Description: is the VE number of the failed VE. is the detailed reason for the failure. It will be one of the following: "Not found veslot file" "Not found os_state file" "VEOS state is not ONLINE(state=)" 2) Message where VE is not installed, and health check is skipped "Not found VEs. Skip VEs health check." 3) Message indicating the action mode "[NodeName=,ActionMode=]Take action due to VEs health check failure." Description: is the name of the compute node where the VE node failed. is the value of the action mode. It can be 0 or 1. 4) Messages that failed to get state before the state change of the failure compute node "[NodeName=]Failed to get the state of the compute node due to VEs health check failure, skip the node state change." Description: is the name of the compute node where the VE node failed. 5) Message that failed to change the failed compute node to the DOWN state "[NodeName=]Failed to update the state of the compute node to DOWN due to VEs health check failure." Description: is the name of the compute node where the VE node failed. 6) Message in which the compute node state is DOWN/DRAIN and the health check or node state change is skipped "[NodeName=]The state of the compute node is , skip VEs health check." Description: is the name of the compute node where the VE node failed. is the value of the State column displayed by "scontrol show node" command. 7) Message that failed to get the state of the compute node before the health check and skipped the health check "[NodeName=Failed to get the state of the compute node, skip VEs health check." Description: is the name of the compute node to be checked. (1-3-7) Notes 1) If the health check takes longer than 60 seconds due to some reason, SLURM will forcibly terminate the health check script, so even if the VE fails, it may not be detected. (2) VE Accounting (2-1) Overview When a user executes a VE job, it is possible to aggregate the foll- owing VE accounting information for each job on the compute node after the job execution is completed. - CPU consumption time on VE nodes - Maximum memory consumption on VE nodes - Total memory consumption on VE nodes - Average memory consumption on VE nodes - List of used VE nodes Run the aggregation command on the compute node to aggregate the VE accounting information. The aggregated information is saved in a file. The aggregated VE accounting information can be displayed with the same command. Before using this feature, it is necessary to enable the output of the VE process accounting. For detailed configuration instructions, refer to section 4.14 Configuration for Process accounting in the SX-Aurora TSUBASA Installation Guide. https://sxauroratsubasa.sakura.ne.jp/documents/guide/pdfs/InstallationGuide_E.pdf In addition, since the cgroup-related information of the job is used when aggregating VE accounting information, so it is necessary to en- able the cgroup function of SLURM. (2-2) Aggregation command(ve_acct) This command aggregates or displays VE accounting information. The execution format of the command is as follows. ve_acct [-h|--help] [-d|--acct-dir ] [-l|--log-file ] [-i|--jobids ] [-r|--aggregate-run] [-c|--aggregate-complete] The meaning of each option is as follows. -h, --help Shows this help message and exit. -d, --acct-dir Specify the directory to save the VE accounting information file (file format is json) to be aggregated or displayed(hereafter ref- erred to as the "account save directory"). You can specify either an absolute path or a relative path. If the specified directory does not exist, it will be created. If this option is not specif- ied, the default value for the account save directory is /var/spool/slurm_for_ve/. -l, --log-file Specify the file name in which the log when aggregating or display- ing VE accounting information is saved(hereafter referred to as the "log file name"). You can specify either an absolute path or a re- lative path. If the specified file does not exist, it will be creat- ed. If it already exists, append the log to the file. If this option is not specified, the default log file name is /var/log/veacct.log. -i, --jobids Specify the job ID of the VE accounting information to be displayed. You can specify multiple job IDs by separating them with commas (eg -i 0,1,2). If this option is not specified, the VE accounting infor- mation of all jobs will be displayed. This option cannot be specif- ied together with -r, --aggregate-run or -c, --aggreate-complete. -r, --aggregate-run Collects the job ID of the running job and the session ID of the VE process (hereafter referred to as "base information"). This option cannot be specified together with -i, --jobids. -c, --aggregate-complete Aggregates the VE accounting information of completed jobs. VE acc- ounting information are aggregated when a job is completed based on collected base information, so it is necessary to collect base infor- mation with -r, --aggregate-run before executing the aggregation command with this option. If this option is specified together with -r, --aggregate-run, both the collection of base information of run- ning jobs and the aggregation of VE accounting information of comp- leted jobs will be performed. This option cannot be specified together with -i, --jobids. If the aggregation command is executed without specifying -r, --aggregate-run and -c, --aggregate-complete, the aggregated VE accounting information will be displayed. (2-2-1) Command placement This command can be installed on compute nodes with the following package created from slurm-22.05.2.tar bz2. RHEL7: slurm-for_ve-22.05.02-1.el7.x86_64.rpm RHEL8: slurm-for_ve-22.05.02-1.el8.x86_64.rpm The command is installed under /usr/bin/. At the same time as this command, the command dump-veacct for referencing the VE process acc- ounting file is installed in the same directory. Since the cgroup-related information of the job is used to aggregate VE accounting information, please enable one of the following sett- ings in /etc/slurm/slurm.conf. ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup JobacctGatherType=jobacct_gather/cgroup To execute this command, python 3.6 or higher is required. (2-2-2) Aggregation settings This command performs aggregation on the assumption that 8 VEs is installed in the compute node. For non-8VEs, open /usr/bin/ve_acct and change the following global variable at the beginning to the appropriate value. Variable Name : VE_NODE_NUMBER_PER_VH Change example: VE_NODE_NUMBER_PER_VH = 16 Since this command is not a daemon process, one execution aggreg- ates or displays one time. By using the cron function of Linux, VE accounting information can be aggregated periodically. Regardless of whether you use cron or not, you need to run the command with root privileges to collect or aggregate. The following are examples of a cron configuration. - Base information of running jobs is collected at 1-minute inter- vals. --- # crontab -e * * * * * /usr/bin/ve_acct -r --- - Aggregate VE accounting information for completed jobs at 5-minute intervals --- # crontab -e */5 * * * * /usr/bin/ve_acct -c --- - Aggregate base information of running jobs and VE accounting information of completed jobs at 3-minute intervals at the same time --- # crontab -e */3 * * * * /usr/bin/ve_acct -r -c --- To ensure that all base information for running jobs is collected, the sampling interval should be less than the elapsed time of the shortest job. This command refers to the VE process accounting file (/var/opt/nec/ve/account/pacct_ ) to aggregate the VE account -ing information for the job. If the VE process accounting file is moved due to rotation settings, etc., and if there are unaggregated VE accounts in the file, aggregation will not be possible. Set the /etc/logrotate.d/psacct-ve file, which is the rotation setting of the VE process accounting file on the VEOS, accordingly. (2-3) Aggregation of VE accounts - Summary items The VE resources aggregated by this feature and their units are as follows. ---------------------------------------------------------- Item Unit ---------------------------------------------------------- CPU consumption time on VE nodes tick(1tick=10ms) Maximum memory consumption on VE nodes Kbytes Total memory consumption on VE nodes Kbytes*tick Average memory consumption on VE nodes Kbytes List of used VE nodes - ---------------------------------------------------------- - Calculation method CPU consumption time on VE nodes and total memory consumption on VE nodes are the sum of the values of each process of the job retrieved from the VE process accounting files. Maximum memory consumption on VE nodes is the maximum value am- ong the total VE maximum memory usage of the processes executed on each VE node used by the job. Average memory consumption on VE nodes is calculated by dividing total memory consumption on VE nodes by CPU consumption time on VE nodes. If CPU consumption time on VE nodes is 0, average mem- ory consumption on VE nodes is the same value as the total memory consumption on VE nodes List of used VE nodes is a list of the VE numbers used by the job. If a VE node is requested but not used, the VE number is not counted. - VE account file The file that saves the base information of the running job and the VE accounting information of the completed job is called the VE account file. The detailed information of the file is as foll- ows. Format: JSON Naming: File that saves only base information: ..json File that aggregated VE accounting information: .json * is the value of the actual job ID. Save location: -d, --acct-dir specified: // -d, --acct-dir is not specified: /var/spool/slurm_for_ve// Contents: If only the base information of the running job is collected, the contents of the file will be the job ID and the session ID list of the VE process executed in the job. If VE accounting information have been aggregated, VE account- ing information will be added in addition to the base informa- tion. (2-4) Display VE accounts If the aggregation command is executed without -r, --aggregate-run and -c, --aggregate-complete, the VE accounting information of all jobs will be displayed. If you specify the -i, --jobids option, the VE accounting information for the specified job are displayed. If -d, --acct-dir is specified, VE accounts under the specified account save directory are displayed. If -d, --acct-dir is not specified, VE accounts under the default account save directory /var/spool/slurm_for_ve/ are displayed. A display example is shown below. --- $ ve_acct JobID VECpuTime(s) VEMaxMemory(K) VEMeanMemory(K) VEKcoreMin(KMin) VENodeList ------------ ------------ --------------- ----------------- ----------------- --------------- 61682 42.77 21801984 21710550.68 15476004.21 0,1 61683 42.51 21848064 21763542.39 15419469.79 0,1,2,3 61684 46.51 21858060 21763502.31 15419469.79 0,1,2,3,4,5,6,7 --- The columns are described below. ------------------------------------------------------------------- JobID Job ID VECpuTime(s) CPU consumption time on VE nodes in seconds. It is calculated by dividing the aggregated value by 100. VEMaxMemory(K) Maximum memory consumption on VE nodes in Kbytes. VEMeanMemory(K) Average memory consumption on VE nodes in Kbytes. VEKcoreMin(KMin) Total memory consumption on VE nodes in Kbytes*min. It is calculated by dividing the aggregated value by 6000(100*60). VENodeList List of used VE nodes. ------------------------------------------------------------------- (2-5) Notes 1) VE account aggregation of jobs that have already ended before executing the aggregation command is not possible. 2) If the VE process accounting file is set to be rotated, the VE accounts may not be aggregated correctly depending on the job end timing and rotation timing. Make sure that you configure the VE process accounting file appropriately to rotate the process acc- ounting file without any running jobs. 3) If you change the account save directory in the middle, you may not be able to aggregate the VE accounts correctly. After confirm- ing that all jobs have been completed and the VE accounts has been output, change the account save directory. 4) VE account file"..json" is an unfinished file, so when performing generation management for VE account files, do not execute this file. 5) VE account files are named by job ID. An existing VE account may be overwritten if the job ID wraps around once. In addition, the file size of the account save directory increases as the accounts are aggregated, which may affect account display and aggregation. Therefore, please manage the generation of files under the account save directory appropriately. 6) If the VE account file name fails to be changed after the VE account aggregation, leave it as it is and the aggregation of the job will be repeated. Please check the following log in the log file and change the file name of the corresponding job to . --- error:write_aggregated_jobacct: Failed to rename account file.(jobid: file:) --- 7) Jobs that fail to run and do not have a start or end time will not be aggregated in the VE accounts. 8) If the log file is not accessible, the log is output to /var/log/messages when the VE accounts is displayed or aggregated.