Uni.lu HPC School 2019 - UL HPC Tutorials

3y ago
48 Views
2 Downloads
1.10 MB
72 Pages
Last View : 8d ago
Last Download : 3m ago
Upload by : Angela Sonnier
Transcription

Uni.lu HPC School 2019PS3: [Advanced] Job scheduling (SLURM)Uni.lu High Performance Computing (HPC) TeamC. ParisotUniversity of Luxembourg (UL), Luxembourghttp://hpc.uni.luC. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS31 / 51

Latest versions available on Github:UL HPC tutorials:https://github.com/ULHPC/tutorialsUL HPC School:PS3 tutorial ls.rtfd.io/en/latest/scheduling/advanced/C. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS32 / 51

IntroductionSummary1 Introduction2 SLURM workload managerSLURM concepts and design for irisRunning jobs with SLURM3 OAR and SLURM4 ConclusionC. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS33 / 51

IntroductionMain Objectives of this SessionDesign and usage of SLURM֒ cluster workload manager of the UL HPC iris cluster֒ . . . and future HPC systemsThe tutorial will show you:the way SLURM was configured, accounting and permissionscommon and advanced SLURM tools and commands֒ ֒ ֒ ֒ srun, sbatch, squeue etc.job specificationSLURM job typescomparison of SLURM (iris) and OAR (gaia & chaos )SLURM generic launchers you can use for your own jobsDocumentation & comparison to OARhttps://hpc.uni.lu/users/docs/scheduler.htmlC. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS34 / 51

SLURM workload managerSummary1 Introduction2 SLURM workload managerSLURM concepts and design for irisRunning jobs with SLURM3 OAR and SLURM4 ConclusionC. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS35 / 51

SLURM workload managerSLURM - core conceptsSLURM manages user jobs with the following key characteristics:֒ set of requested resources:X number of computing resources: nodes (including all their CPUsand cores) or CPUs (including all their cores) or coresX number of accelerators (GPUs)X amount of memory: either per node or per (logical) CPUX the (wall)time needed for the user’s tasks to complete their work֒ a set of constraints limiting jobs to nodes with specific features֒ a requested node partition (job queue)֒ a requested quality of service (QoS) level which grants usersspecific accesses֒ a requested account for accounting purposesExample: run an interactive jobAlias: si [.](access) srun p interactive qos qos interactive pty bash i(node) echo SLURM JOBID2058Simple interactive job running under SLURMC. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS36 / 51

SLURM workload managerSLURM - job example (I)5101520 scontrol show job 2058JobId 2058 JobName bashUserId vplugaru(5143) GroupId clusterusers(666) MCS label N/APriority 100 Nice 0 Account ulhpc QOS qos interactiveJobState RUNNING Reason None Dependency (null)Requeue 1 Restarts 0 BatchFlag 0 Reboot 0 ExitCode 0:0RunTime 00:00:08 TimeLimit 00:05:00 TimeMin N/ASubmitTime 2017 06 09T16:49:42 EligibleTime 2017 06 09T16:49:42StartTime 2017 06 09T16:49:42 EndTime 2017 06 09T16:54:42 Deadline N/APreemptTime None SuspendTime None SecsPreSuspend 0Partition interactive AllocNode:Sid access2:163067ReqNodeList (null) ExcNodeList (null)NodeList iris 081BatchHost iris 081NumNodes 1 NumCPUs 1 NumTasks 1 CPUs/Task 1 ReqB:S:C:T 0:0: : TRES cpu 1,mem 4G,node 1Socks/Node NtasksPerN:B:S:C 1:0: : CoreSpec MinCPUsNode 1 MinMemoryCPU 4G MinTmpDiskNode 0Features (null) DelayBoot 00:00:00Gres (null) Reservation (null)OverSubscribe OK Contiguous 0 Licenses (null) Network (null)Command bashWorkDir /mnt/irisgpfs/users/vplugaruPower Simple interactive job running under SLURMC. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS37 / 51

SLURM workload managerSLURM - job example (II)Many metrics available during and after job execution֒ including energy (J) – but with caveats֒ job steps counted individually֒ enabling advanced application debugging and optimizationJob information available in easily parseable format (add -p/-P) sacct j 2058 format serJobIDJobName PartitionStateulhpc vplugaru2058bash interacti COMPLETED510 sacct j 2058 format elapsed,elapsedraw,start,endElapsed ElapsedRawStartEnd00:02:56176 2017 06 09T16:49:42 2017 06 09T16:52:38 sacct j 2058 format nnodes,ncpus,nodelistMaxRSS MaxVMSize ConsumedEnergy ConsumedEnergyRaw NNodes NCPUSNodeList0299660K17.89K17885.00000011iris 081Job metrics after execution endedC. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS38 / 51

SLURM workload managerSLURM - design for iris (I)Partition# NodesDefault timeMax timeMax :0:00-4:0:030-0:0:0unlimitedunlimitedunlimited22C. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS39 / 51

SLURM workload managerSLURM - design for iris (I)Partition# NodesDefault timeMax timeMax SMax coresMax os-interactiveqos-longno limit2344no limitno limit16816810010101010C. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS39 / 51

SLURM workload managerSLURM - desing for iris (II)You have some private QoS not accessible to all users.QoSUser groupMax coresMax ivateprivateprivateALLALLALLprivateALLprivateno limit23441400256256no limitno limit1685616856100100100100101010101010C. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS310 / 51

SLURM workload managerSLURM - design for iris (III)Default partition: batch, meant to receive most user jobs֒ we hope to see majority of user jobs being able to scale֒ shorter walltime jobs highly encouragedAll partitions have a correspondingly named QOS֒ ֒ ֒ ֒ granting resource access (long : qos-long)any job is tied to one QOS (user specified or inferred)automation in place to select QOS based on partitionjobs may wait in the queue with QOS*Limit reason setX e.g. QOSGrpCpuLimit if group limit for CPUs was reachedC. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS311 / 51

SLURM workload managerSLURM - design for iris (III)Default partition: batch, meant to receive most user jobs֒ we hope to see majority of user jobs being able to scale֒ shorter walltime jobs highly encouragedAll partitions have a correspondingly named QOS֒ ֒ ֒ ֒ granting resource access (long : qos-long)any job is tied to one QOS (user specified or inferred)automation in place to select QOS based on partitionjobs may wait in the queue with QOS*Limit reason setX e.g. QOSGrpCpuLimit if group limit for CPUs was reachedPreemptible besteffort QOS available for batch and interactivepartitions (but not yet for bigmem, gpu or long)֒ meant to ensure maximum resource utilization especially on batch֒ should be used together with restartable softwareQOSs specific to particular group accounts exist (discussed later)֒ granting additional accesses to platform contributorsC. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS311 / 51

SLURM workload managerSLURM - design for iris (IV)Backfill scheduling for efficiency֒ multifactor job priority (size, age, fair share, QOS, . . . )֒ currently weights set for: job age, partition and fair share֒ other factors/decay to be tuned as neededX with more user jobs waiting in the queuesResource selection: consumable resources֒ ֒ ֒ ֒ cores and memory as consumable (per-core scheduling)GPUs as consumable (4 GPUs per node in the gpu partition)block distribution for cores (best-fit algorithm)default memory/core: 4GB (4.1GB maximum, rest is for OS)X gpu and bigmem partitions: 27GB maximumC. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS312 / 51

SLURM workload managerSLURM - design for iris (IV)Backfill scheduling for efficiency֒ multifactor job priority (size, age, fair share, QOS, . . . )֒ currently weights set for: job age, partition and fair share֒ other factors/decay to be tuned as neededX with more user jobs waiting in the queuesResource selection: consumable resources֒ ֒ ֒ ֒ cores and memory as consumable (per-core scheduling)GPUs as consumable (4 GPUs per node in the gpu partition)block distribution for cores (best-fit algorithm)default memory/core: 4GB (4.1GB maximum, rest is for OS)X gpu and bigmem partitions: 27GB maximumUser process tracking with cgroups֒ cpusets used to constrain cores and RAM (no swap allowed)֒ task affinity used to bind tasks to cores (hwloc based)Hierarchical tree topology defined (for the network)֒ for optimized job resource allocationC. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS312 / 51

SLURM workload managerSLURM - design for iris (IV)Hto elpop witim ll biz e neyo eedur edjo ob npa yora urm pet arer ts!Backfill scheduling for efficiency֒ multifactor job priority (size, age, fair share, QOS, . . . )֒ currently weights set for: job age, partition and fair share֒ other factors/decay to be tuned as neededX with more user jobs waiting in the queuesResource selection: consumable resources֒ ֒ ֒ ֒ cores and memory as consumable (per-core scheduling)GPUs as consumable (4 GPUs per node in the gpu partition)block distribution for cores (best-fit algorithm)default memory/core: 4GB (4.1GB maximum, rest is for OS)X gpu and bigmem partitions: 27GB maximumUser process tracking with cgroups֒ cpusets used to constrain cores and RAM (no swap allowed)֒ task affinity used to bind tasks to cores (hwloc based)Hierarchical tree topology defined (for the network)֒ for optimized job resource allocationC. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS312 / 51

SLURM workload managerA note on job priorityJob priority (PriorityWeightAge) * (age factor) (PriorityWeightFairshare) * (fair-share factor) (PriorityWeightJobSize) * (job size factor) (PriorityWeightPartition) * (partition factor) (PriorityWeightQOS) * (QOS factor) SUM(TRES weight cpu * TRES factor cpu,TRES weight type * TRES factor type ,.)For complete details see: slurm.schedmd.com/priority multifactor.htmlTRES - Trackable RESources֒ CPU, Energy, Memory and Node tracked by defaultGRES - Generic RESources֒ GPUCorresponding weights/reset periods tuned with your feedback֒ we require (your & your group’s) usage pattern to optimize them֒ the target is high interactivity (low time spent by the jobs waiting)C. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS313 / 51

SLURM workload managerSLURM - design for iris (V)Some details on job permissions.Partition limits association-based rule enforcement֒ association settings in SLURM’s accounting databaseQOS limits imposed, e.g. you will see (QOSGrpCpuLimit)Only users with existing associations able to run jobsC. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS314 / 51

SLURM workload managerSLURM - design for iris (V)Some details on job permissions.Partition limits association-based rule enforcement֒ association settings in SLURM’s accounting databaseQOS limits imposed, e.g. you will see (QOSGrpCpuLimit)Only users with existing associations able to run jobsBest-effort jobs possible through preemptible QOS: qos-besteffort֒ of lower priority and preemptible by all other QOS֒ preemption mode is requeue, requeueing disabled by defaultC. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS314 / 51

SLURM workload managerSLURM - design for iris (V)Some details on job permissions.Partition limits association-based rule enforcement֒ association settings in SLURM’s accounting databaseQOS limits imposed, e.g. you will see (QOSGrpCpuLimit)Only users with existing associations able to run jobsBest-effort jobs possible through preemptible QOS: qos-besteffort֒ of lower priority and preemptible by all other QOS֒ preemption mode is requeue, requeueing disabled by defaultOn metrics: Accounting & profiling data for jobs sampled every 30s֒ tracked: cpu, mem, energy֒ energy data retrieved through the RAPL mechanismX caveat: for energy not all hw. that may consume power ismonitored with RAPL (CPUs, GPUs and DRAM are included)C. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS314 / 51

SLURM workload managerSLURM - design for iris (VI)On tightly coupled parallel jobs (MPI)֒ Process Management Interface (PMI 2) highly recommended֒ PMI2 used for better scalability and performanceX faster application launchesX tight integration w. SLURM’s job steps mechanism (& metrics)X we are also testing PMIx (PMI Exascale) supportC. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS315 / 51

SLURM workload managerSLURM - design for iris (VI)On tightly coupled parallel jobs (MPI)֒ Process Management Interface (PMI 2) highly recommended֒ PMI2 used for better scalability and performanceX faster application launchesX tight integration w. SLURM’s job steps mechanism (& metrics)X we are also testing PMIx (PMI Exascale) support֒ PMI2 enabled in default software set for IntelMPI and OpenMPIX requires minimal adaptation in your workflowsX (at minimum:) replace mpirun with SLURM’s srunX if you compile/install your own MPI you’ll need to configure it֒ Many examples at:https://hpc.uni.lu/users/docs/slurm launchers.htmlC. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS315 / 51

SLURM workload managerSLURM - design for iris (VI)On tightly coupled parallel jobs (MPI)֒ Process Management Interface (PMI 2) highly recommended֒ PMI2 used for better scalability and performanceX faster application launchesX tight integration w. SLURM’s job steps mechanism (& metrics)X we are also testing PMIx (PMI Exascale) support֒ PMI2 enabled in default software set for IntelMPI and OpenMPIX requires minimal adaptation in your workflowsX (at minimum:) replace mpirun with SLURM’s srunX if you compile/install your own MPI you’ll need to configure it֒ Many examples at:https://hpc.uni.lu/users/docs/slurm launchers.htmlSSH-based connections between computing nodes still possible֒ other MPI implementations can still use ssh as launcherX but really shouldn’t need to, PMI2 support is everywhere֒ user jobs are tracked, no job no access to nodeC. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS315 / 51

SLURM workload managerSLURM - design for iris (VII)ULHPC customization through pluginsJob submission rule / filter֒ for now: QOS initialization (if needed)֒ more rules to come (group credits, node checks, etc.)Per-job temporary directories creation & cleanup֒ ֒ ֒ ֒ better security and privacy, using kernel namespaces and binding/tmp & /var/tmp are /tmp/ jobid. rstcnt/[tmp,var tmp]transparent for apps. ran through srunapps. ran with ssh cannot be attached, will see base /tmp!X11 forwarding (GUI applications)֒ Some issue prevents us to use ‘–x11‘ option of SLURM on irisX workaround in the tutorial/FAQX create job using salloc and then use ssh -YC. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS316 / 51

SLURM workload managerSLURM - design for iris (VIII)Software licenses in SLURMARM (ex. Allinea) Forge and Performance Reports for now֒ static allocation in SLURM configuration֒ dynamic checks for FlexNet / RLM based apps. coming laterNumber and utilization state can be checked with:֒ scontrol show licensesUse not enforced, honor system applied֒ srun [.] -L licname: licnumber srun -N 1 -n 28 -p interactive -L forge:28 --pty bash -iC. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS317 / 51

SLURM workload managerSLURM - bank (group) accountsHierarchical bank (group) accountsUL as root account, then underneathaccounts for the 3 Faculties and 3 ICsAll Prof., Group leaders and above havebank accounts, linked to a Faculty or IC֒ with their own name: Name.SurnameAll user accounts linked to a bankaccount֒ including Profs.’s own userIris accounting DB contains over֒ 103 group accounts from Faculties, ICs& Externals֒ comprising 877 users{Allows better usage tracking and reporting than was possible before. }C. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS318 / 51

SLURM workload managerSLURM - brief commands overviewsqueue: view queued jobssinfo: view partition and node info.sbatch: submit job for batch (scripted) executionsrun: submit interactive job, run (parallel) job stepscancel: cancel queued jobsC. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS319 / 51

SLURM workload managerSLURM - brief commands overviewsqueue: view queued jobssinfo: view partition and node info.sbatch: submit job for batch (scripted) executionsrun: submit interactive job, run (parallel) job stepscancel: cancel queued jobsscontrol: detailed control and info. on jobs, queues, partitionssstat: view system-level utilization (memory, I/O, energy)֒ for running jobs / job stepssacct: view system-level utilization֒ for completed jobs / job steps (accounting DB)sacctmgr: view and manage SLURM accounting dataC. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS319 / 51

SLURM workload managerSLURM - brief commands overviewsqueue: view queued jobssinfo: view partition and node info.sbatch: submit job for batch (scripted) executionsrun: submit interactive job, run (parallel) job stepscancel: cancel queued jobsscontrol: detailed control and info. on jobs, queues, partitionssstat: view system-level utilization (memory, I/O, energy)֒ for running jobs / job stepssacct: view system-level utilization֒ for completed jobs / job steps (accounting DB)sacctmgr: view and manage SLURM accounting datasprio: view job priority factorssshare: view accounting share info. (usage, fair-share, etc.)C. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS319 / 51

SLURM workload managerSLURM - basic commandsActionSLURM commandSubmit passive/batch jobStart interactive jobQueue statusUser (own) jobs statusSpecific job status (detailed)Job metrics (detailed)Job accounting status (detailed)Job efficiency reportDelete (running/waiting) jobHold jobResume held jobNode list and their propertiesPartition list, status and limitsAttach to running jobsbatch scriptsrun --pty bash -isqueuesqueue -u USERscontrol show job jobidsstat --job jobid -lsacct --job jobid -lseff jobidscancel jobidscontrol hold jobidscontrol release jobidscontrol show nodessinfosjoin jobid [ node]QOS deduced if not specified, partition needs to be set if not "batch"C. Parisot & Uni.lu HPC Team (University of Luxembourg)NUni.lu HPC School 2019/ PS320 / 51

SLURM workload managerSLURM - basic options for sbatch/srunActionsbatch/srun optionRequest n distributed nodesRequest m memory per nodeRequest mc memory per core (logical cpu)Request job walltimeRequest tn tasks per nodeRequest ct cores per task (multithreading)Request nt total # of tasksRequest g # of GPUs per nodeRequest to start job at specific timeSpecify job name as nameSpecify required node featureSpecify job partitionSpecify QOSSpecify accountSpecify email addressRequest email on eventUse the above actions in a batch script-N n--mem mGB--mem-per-cpu mcGB--time d-hh:mm:ss--ntasks-per-node tn-c ct-n nt--gres gpu: g--begin time-J name-C feature-p partition--qos qos-A account--m

Uni.lu HPC School 2019 PS3: [Advanced] Job scheduling (SLURM) Uni.lu High Performance Computing (HPC) Team C. Parisot University of Luxembourg (UL), Luxembourg

Related Documents:

XSEDE HPC Monthly Workshop Schedule January 21 HPC Monthly Workshop: OpenMP February 19-20 HPC Monthly Workshop: Big Data March 3 HPC Monthly Workshop: OpenACC April 7-8 HPC Monthly Workshop: Big Data May 5-6 HPC Monthly Workshop: MPI June 2-5 Summer Boot Camp August 4-5 HPC Monthly Workshop: Big Data September 1-2 HPC Monthly Workshop: MPI October 6-7 HPC Monthly Workshop: Big Data

Blade Runner Classic Uncommon flooring - Common standards Solerunner Uni Solerunner Bladerunner Solerunner Uni Uni ICE Uni SKY Uni SAND Uni EARTH Uni NIGHT Uni POOL Uni MOSS Uni PINE Sky Sky UNI Sky STONE ENDURANCE VISION SPLASH Ice Ice UNI Ice STONE Ice ENDURANCE Ice SPL

HPC Architecture Engineer Sarah Peter Infrastructure & Architecture Engineer LCSB BioCore sysadmins manager UniversityofLuxembourg,BelvalCampus MaisonduNombre,4thfloor 2,avenuedel’Université L-4365Esch-sur-Alzette mail: hpc@uni.lu 1 Introduction 2 HPCContainers 11/11 E.Kieffer&Uni.luHPCTeam (UniversityofLuxembourg) Uni.luHPCSchool2020/PS6 .

Microsoft HPC Pack [6]. This meant that all the users of the Windows HPC cluster needed to be migrated to the Linux clusters. The Windows HPC cluster was used to run engineering . HPC Pack 2012 R2. In order to understand how much the resources were being used, some monitoring statis-tics were extracted from the cluster head node. Figure 2 .

Building up High Performance Computing & Big Data Competence Center to support national priorities 1 . expert UL HPC team S. Varrette, V. Plugaru, S. Peter, H. Cartiaux, C. Parisot, among others . SC-Camp 2017 (Cadiz) Bi-annual HPC School @ Uni.lu (Part of the doctoral program)

Uni.luHPCSchool2019 PS07: Scientificcomputingusing MATLAB Uni.luHighPerformanceComputing(HPC)Team V.Plugaru UniversityofLuxembourg(UL),Luxembourg

16247-1:2012 Requisiti generali UNI CEI EN 16247-2:2014 Edifici UNI CEI EN 16247-3:2014 Processi UNI CEI EN 16247-5 Qualificazione degli Energy Auditors (2015) UNI CEI EN 16247-4:2014 Trasporti UNI CEI EN 16247 9 . UNI CEI EN 1624

However, the machining of these typically difficult-to-cut materials poses a challenge for conventional manufacturing technologies due to the high tool wear. Abrasive water jet (AWJ) machining is a promising alternative manufacturing technology for machining difficult-to-cut materials,