Slurm Job Scheduling Primer Terra - Texas A&M University

1y ago
23 Views
2 Downloads
3.45 MB
30 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Wade Mabry
Transcription

TerraSlurm Job Scheduling PrimerSpring 2020Texas A&M UniversityHigh Performance Research Computinghprc.tamu.edu1

HPRC’s Newest ClusterGrace is a 925-node Intel cluster from Dell with anInfiniBand HDR-100 interconnect, A100 GPUs, RTX6000 GPUs and T4 GPUs. There are 925 nodesbased on the Intel Cascade Lake processor.Grace3TB Large Memory-80 cores/nodesOther Login Nodes-48 cores/nodeGrace Status: Testing and Early user onboardingLogin Nodes5384GB memory general compute nodes800GPU - A100 nodes with 384GB memory100GPU - RTX 6000 nodes with 384GB memory9GPU - T4 nodes with 384GB memory83TB Large Memory8Texas A&M UniversityHigh Performance Research ComputingAvailable late Spring 2021For more hprc.tamu.edu2

HPC Diagramcompute node 1compute node 2compute node 3login node 1login node 2compute node 4Job SchedulerTerra: Slurmcompute node 5login node 3compute node 6hundreds of more compute nodesShared File StorageTexas A&M UniversityHigh Performance Research Computinghprc.tamu.edu3

File Systems and User DirectoriesDirectory/home/ USER/scratch/user/ USEREnvironment VariableSpace LimitFile LimitIntended Use HOME10 GB10,000Small to modest amounts of processing. SCRATCH1 TB250,000Temporary storage of large files for on-going computations.Not intended to be a long-term storage area. View usage and quota limits using the command: Quota and file limit increases will only be considered for scratch and tiered directories Request a group directory for sharing files.Texas A&M UniversityHigh Performance Research Computinghprc.tamu.edu4

Pop QuizWhich one of the following is not an HPRC cluster?A. AdaB. BozoTexas A&M UniversityC. GraceD. TerraHigh Performance Research Computinghprc.tamu.edu5

Batch Computing on HPRC ClustersOn-campus:A batch job script is a text file that contains Unix and software commands and Batch manager job bmitjobBatch manager:LSF on AdaSlurm on TerraQueueJob scriptfileVPN SSHInternetOutputFilesOff-campus:Cluster compute nodesTexas A&M UniversityHigh Performance Research Computinghprc.tamu.edu6

Sample Job Script Structure (Terra)These parameters describe your job tothe job schedulerThis is single line comment and not run as part of the scriptLoad the required module(s) firstThis is a command that is executed by the jobTexas A&M UniversityHigh Performance Research Computinghprc.tamu.edu7

Important Batch Job ParametersTerraCommentInitialize job environment.Specifies the time limit for the job.Must specify seconds SS on TerraTotal number of tasks (cores) for the job.Specifies the maximum number of tasks(cores) to allocate per nodeSets the maximum amount of memory (MB).G for GB is supported on Terrahprc.tamu.edu/wiki/HPRC:Batch TranslationTexas A&M UniversityHigh Performance Research Computinghprc.tamu.edu8

Mapping Jobs to Cores per Node on TerraA.B.C.28 cores on1 compute node#SBATCH --ntasks 28#SBATCH --tasks-per-node 2828 cores on2 compute nodesPreferred Mapping(if applicable)Texas A&M University#SBATCH --ntasks 28#SBATCH --tasks-per-node 14High Performance Research Computing28 cores on4 compute nodes#SBATCH --ntasks 28#SBATCH --tasks-per-node 7hprc.tamu.edu9

Job Memory Requests on Terra Specify memory request based on memory per node:#SBATCH --mem xxxxMor#SBATCH --mem xG # memory per node in MB# memory per node in GBOn 64GB nodes, usable memory is at most 56 GB. Theper-process memory limit should not exceed 2000 MB for a28-core job.On 128GB nodes, usable memory is at most 112 GB. Theper-process memory limit should not exceed 4000 MB for a28-core job.Texas A&M UniversityHigh Performance Research Computinghprc.tamu.edu10

Consumable Computing Resources Resources specified in a job file: Processor coresMemoryWall timeGPU Other resources: Software license/token Use "license status" to query hprc.tamu.edu/wiki/SW:License Checker Service Unit (SU) - Billing Account Use "myproject" to queryFind available license for "ansys":hprc.tamu.edu/wiki/HPRC:AMS:Service UnitFind detail options:Texas A&M UniversityHigh Performance Research Computinghprc.tamu.edu11

Terra: Examples of SUs charged based onJob Cores, Time and Memory RequestedA Service Unit (SU) on Terra is equivalent to one core or 2 GB memory usage for one hour.Number ofCoresGB of memoryper coreTotal Memory(GB)HoursSUs .edu/wiki/HPRC:AMS:Service UnitTexas A&M UniversityHigh Performance Research Computinghprc.tamu.edu12

Batch Queues Job submissions are auto-assigned to batch queues based on theresources requested (number of cores/nodes and walltime limit) Some jobs can be directly submitted to a queue: On Terra, if gpu nodes are needed, use the gpu partition/queue:#SBATCH --partition gpu Jobs that have special resource requirements are scheduled in the specialqueue (must request access to use this queue)hprc.tamu.edu/wiki/Terra:Batch#QueuesTexas A&M UniversityHigh Performance Research Computinghprc.tamu.edu13

sinfo : Current Queues on TerranetidFor the NODES and CPUS columns:A Active (in use by running jobs)I Idle (available for jobs)O Offline (unavailable for jobs)T TotalTexas A&M UniversityHigh Performance Research Computinghprc.tamu.edu14

Queue Limits on TerraQueueJob MaxCores / NodesJob MaxWalltimeshort448 cores / 16 nodes2 hrsCompute Node TypesPer-User LimitsAcross Queues64 GB nodes (256)128 GB nodes withGPUs (36)1800 cores peruser448 cores peruserNotesmedium1792 cores / 64 nodes1 daylong896 cores / 32 nodes7 daysxlong448 cores / 16 nodes21 days64 GB nodes (256)gpu1344 cores / 48 nodes2 days128 GB nodes withGPUs (48)For jobs requiringGPUs.vnc28 cores / 1 node12 hours128 GB nodes withGPUs (48)For remotevisualization jobsknlBatch Queue Policies also at:68 cores / 8 nodes96 GB nodes with h#Queues7 days72 cores / 8 nodesprocessors (16)Texas A&M UniversityHigh Performance Research Computinghprc.tamu.edu--partition xlongFor jobs requiringa KNL processor15

Submitting Your Job and Check Job StatusSubmit jobCheck statusTexas A&M UniversityHigh Performance Research Computinghprc.tamu.edu16

Terra Job File (multi core, single node)SUs 91Texas A&M UniversityHigh Performance Research Computinghprc.tamu.edu17

Terra Job File (multi core, multi node)SUs 288Texas A&M UniversityHigh Performance Research Computinghprc.tamu.edu18

Terra Job File (serial GPU)SUs 28Texas A&M UniversityHigh Performance Research Computinghprc.tamu.edu19

Other Type of Jobs MPI and OpenMP Visualization: portal.hprc.tamu.edu(visualization jobs can be run on both Ada and Terra; more details in later slide) Large number of concurrent single core jobs Check out tamulauncher hprc.tamu.edu/wiki/SW:tamulauncher Useful for running many single core commands concurrently across multiple nodes within a job Can be used with serial or multi-threaded programs Distributes a set of commands from an input file to run on the cores assigned to a job Can only be used in batch jobs If a tamulauncher job gets killed, you can resubmit the same job to complete the unfinishedcommands in the input fileTexas A&M UniversityHigh Performance Research Computinghprc.tamu.edu20

Job Submission and TrackingTerraDescriptionsbatch jobfile1Submit jobfile1 to batch systemsqueue [-u user name] [-j job id]List jobsscancel job idKill a jobsacct -X -j job idShow information for a job(can be when job is running or recentlyfinished)sacct -X -S YYYY-HH-MMShow information for all of your jobssince YYYY-HH-MMlnu job idShow resource usage for a jobpestat -u USERShow resource usage for a running jobseff job idCheck CPU/memory efficiency for a jobhprc.tamu.edu/wiki/HPRC:Batch TranslationTexas A&M UniversityHigh Performance Research Computinghprc.tamu.edu21

Check your Service Unit (SU) Balance List the SU Balance of your Account(s)To specify a project ID to charge in the job file Terra:Run "myproject -d Account#" to change default project accountRun "myproject -h" to see more optionshprc.tamu.edu/wiki/HPRC:AMS:Service Unithprc.tamu.edu/wiki/HPRC:AMS:UITexas A&M UniversityHigh Performance Research Computinghprc.tamu.edu22

Job submission issue: insufficient SUsTerra: What to do if you need more SUs Ask your PI to transfer SUs to your accountApply for more SUs (if you are eligible, as a PI or permanent Q: How do I get more SUs.3Fhprc.tamu.edu/wiki/HPRC:AMS:Service Unithprc.tamu.edu/wiki/HPRC:AMS:UITexas A&M UniversityHigh Performance Research Computinghprc.tamu.edu23

List Node Utilization on Terra: lnu# lists the node utilization across all nodes for a running job.# to see more options use:lnu jobidExample:Note: Slurm updates the node information every few minutesNote: CPU LOAD is not the same as % utilizationFor the CPUS columns:A Active (in use by running jobs)I Idle (available for jobs)O Offline (unavailable for jobs)T TotalTexas A&M UniversityHigh Performance Research Computinghprc.tamu.edu24

Monitor Compute Node Utilization on Terra: pestatpestat [-u username]# lists the node utilization across all nodes for a running job.# to see more options use:Example:Low CPU load utilization highlighted in Red( Freemem should also be noted )Good CPU load utilization highlighted in PurpleIdeal CPU load utilization displayed in WhiteTexas A&M UniversityHigh Performance Research Computinghprc.tamu.edu25

Job Environment Variables Terra: SLURM JOBID job id SLURM SUBMIT DIR directory where job was submitted from SCRATCH /scratch/user/NetID TMPDIR /work/job. SLURM JOBID TMPDIR is local to each assigned compute node for the job and is about 850GBhprc.tamu.edu/wiki/Ada:Batch Processing LSF#Environment t VariablesTexas A&M UniversityHigh Performance Research Computinghprc.tamu.edu26

Jobs using tamubatchAutomatic batch job script that submits jobs for the user without theneed of writing a full batch script on the clusterAccess help with:Texas A&M UniversityHigh Performance Research Computinghprc.tamu.edu27

portal.hprc.tamu.eduThe HPRC portal allows users to do the following Browse files on the filesystem Access the Ada, Terra, Curie Unix command line Launch jobs Compose job scripts Launch interactive GUI apps (SUs charged)https://hprc.tamu.edu/wiki/SW:PortalTexas A&M UniversityHigh Performance Research Computinghprc.tamu.edu28

Continued LearningIntro to HPRC Video Tutorial SeriesHPRC’s Wiki Page29Texas A&M UniversityHigh Performance Research Computinghprc.tamu.edu

Thank you.Any questions?Texas A&M UniversityHigh Performance Research Computinghprc.tamu.edu30

Job Scheduler Terra: Slurm hundreds of more compute nodes 3. Texas A&M University High Performance Research Computing hprc.tamu.edu File Systems and User Directories View usage and quota limits using the command: Quota and file limit increases will only be considered for scratch and tiered directories .

Related Documents:

Slurm Job Script Parameters Always include the first two lines exactly as they are. In rare cases, line two is not needed. Slurm job parameters begin with and can add comments afterwards as above Name the job script whatever you like run_program_project.sh my_job_script.job my_job_script.sbatch my_job_script.txt 8

loro Dhamma, percepisce3 la terra come terra. Percependo la terra come terra: Concepisce4 sé stesso come la terra; Si concepisce nella terra; Si concepisce separato dalla terra; Concepisce la terra come "mia"; Prova piacere nella terra. Perché? Perché non ha pienamente compreso, vi dico. Percepisce l’acqua come acqua. Percependo l’acqua .

Latin Primer 1: Teacher's Edition Latin Primer 1: Flashcard Set Latin Primer 1: Audio Guide CD Latin Primer: Book 2, Martha Wilson (coming soon) Latin Primer 2: Student Edition Latin Primer 2: Teacher's Edition Latin Primer 2: Flashcard Set Latin Primer 2: Audio Guide CD Latin Primer: Book 3, Martha Wilson (coming soon) Latin Primer 3 .

Cost-conscious Etc. Pathway Once upon a time in the snow Arriving & Planning . Boutros Lab COVID Use Case. 1. 0 Slurm Azure HPC Cache. 1. 14,400-21,600 cores target 2. 1 year. IPH Regeneron. 1: 1 Slurm Azure HPC Cache 1. 5,000-10,000 cores target: MDL HPC 1: 0 Slurm

Aerosol Optical Depth at 0.55 micron MODIS-Terra/Aqua 00.02/02.07 OPS TS Atmospheric Water Vapor (QA-weighted) MODIS-Terra/Aqua 00.02/02.07 OPS TS MODIS-Terra/Aqua 00.02/02.07 OPS TS Cloud Fraction (Day and Night) MODIS-Terra/Aqua 00.02/02.07 OPS TS Cloud Fraction (Day only/Night only)) MODIS-Terra/Aqua 00.02/02.07 OPS TS

2500/3500, 2500/3500 LimitedRAM Ascent SUBARU Highlander, RAV4, TundraTOYOTA VOLKSWAGEN Atlas The Terra Trac AT X-Journey and Terra Trac AT X-Venture cover over 80% of the total All-Terrain market. TERRA TRAC AT II TERRA TRAC AT X-JOURNEY (Crossover) TERRA TRAC AT X-VENTURE (Light Truck/SUV) DUAL BUTTRESS

cultural features. Terra Vista Pro includes additional output and productivity options and is ideal for building medium and large terrain databases with complex cultural features. TERRA VISTA DART In addition to the features found in Terra Vista Pro, Terra Vista DART provides a wide range of import capabilities. DART is the

Annual Report 2018 REPORT Contents The Provost 2 The Fellowship 5 Tutorial21 Undergraduates37 Graduates42 Chapel46 Choir 52 Research 60 Library and Archives 64 Bursary67 Staff 71 Development75 Major Promotions, Appointments or Awards 103 Appointments & Honours 104 Obituaries107 Information for Non-Resident Members 319. The University has been the subject of press attention in relation to the .