Distributed Computing With HEP Cloud, GlideinWMSand HTCondor

1y ago
28 Views
2 Downloads
3.98 MB
25 Pages
Last View : 5d ago
Last Download : 3m ago
Upload by : Eli Jorgenson
Transcription

Distributed Computing with HEP Cloud,GlideinWMS and HTCondorMarco MambelliIF Computing SchoolJune 21 2021

Outline 2Distributed High Throughput ComputingPilot-based systemsGlideinWMS and HEPCloudStorage and credentialsHTCondorResources and job requirements6/21/21Marco Mambelli GlideinWMS introduction

distributed High Throughput Computing (dHTC) Tasks split in small pieces (jobs)3Google Brief Fermilab HEPCloud6/21/21

distributed High Throughput Computing (dHTC) Tasks split in small pieces (jobs) Resource processing queued jobs4Google Brief Fermilab HEPCloud6/21/21

distributed High Throughput Computing (dHTC) Tasks split in small pieces (jobs) Resource processing queued jobs Run many jobs in parallel to shorten completion5Google Brief Fermilab HEPCloud6/21/21

Where jobs run Your computer––––6InteractiveGUIYour customizationYour software6/21/21Marco Mambelli GlideinWMS introduction Institutional cluster– Batch queue (SLURM,PBS, HTCondor, SGE, )– Terminal– Network access– Familiar environment– Local support

Where jobs run (2) Grid clusters––––7Borrowed resourcesNetwork reachableUnknown environmentMulti-institution supportsystem6/21/21Marco Mambelli GlideinWMS introduction (Commercial) Cloud– Rented resources– Virtual machines

Where jobs run (3) High Performance Computers(HPC)– Each is unique Architecture Network topology– Parallel and coupled jobs (MPI)– Allocations and long queue times86/21/21Marco Mambelli GlideinWMS introduction

Pilot jobs (Glideins) Separation of tasks– Pilot job Test Set up “Expendable”– User/real job Science Late binding Flexible use of multiple resources96/21/21Google Brief Fermilab HEPCloud

Overlay system Pilot layer––––Distributed computing knowledge and troubleshootingReduce heterogeneityHandle different speedsPressure-based submission Virtual cluster– Domain knowledgeand troubleshooting– Elastic Separation forsoftware, systems and people106/21/21Google Brief Fermilab HEPCloud

GlideinWMSGlideinWMS is a pilot based resource provisioning tool fordistributed High Throughput Computing Provides reliable and uniform HTCondor virtual clusters Submits Glideins to unreliable heterogeneous resources Distributed architecture- FactoryJobFrontendQueue- Frontend- ideinMarco Mambelli GlideinWMS y system

Distributed N-to-M relationship– Each Frontend can talk to many Factories– Each Factory may serve many Frontends Multiple User Pools High Availability replicasS.Timm - FNAL-UK Planning Meeting GlideinWMS126/21/21Marco Mambelli GlideinWMS introduction

HEPCloud FacilityUser jobsHEPCloud FacilityProvisionerGCEDecision EngineFactoryVirtual rPublishersUser jobsGlidein submissionGlideins join the overlay136/21/21Google Brief Fermilab HEPCloud

Decision EngineDecision k Manager runs Decision Channel “HPC”Data BlockHPCSources146/21/21Google Brief Fermilab , MonitoringTask Manager runs Decision Channel “GCE”Data Block

Storage types summary System Volumes– Read only Locally Mounter Volumes (Local or RAM disk)– CWD (Current Work Directory)– TMP Interactive Storage Volumes (NAS - NFS, GPFS, Luster, )– Shared file systems– Shared home directories Grid-accessible storage volumes– Distributed file system (HDFS, dCache, Xrootd)– Storage Element CernVM FS (CVMFS)– Write once read everywhere HTTP based distributed FS156/21/21Marco Mambelli GlideinWMS introduction

Credential types X509 Certificate and Proxy– VOMS Extension– Identity based (you and your affiliations) JASON Web Token––––16SciTokenIDTOKENWLCG (IAM) tokenBearer token (capabilitybased)6/21/21Marco Mambelli GlideinWMS introduction

Credentials and data movement in a GlideinCredentials x509 proxy, SSH keys andJWTs (pilot, cluster, storage, job )Data 17Pilot filesSmall job filesSoftware and containersInput/output6/21/21Google Brief Fermilab HEPCloud

HTCondor and ClassAds HTCondor is a Workload Management System (batch system)– Open source, robust, flexible, local (UW Madison) HTCondor principles: two parts of the equation– Jobs: quanta of work– Machines: available resources ClassAds is a language for objects (jobs and machines) to– Express attributes about themselves– Express what they require/desire in a match (similar to personalclassified ads)– Structure Set of attribute name/value pairs Value : Literals (string, bool, int, float or an expression)186/21/21Marco Mambelli GlideinWMS introduction

Example MatchPet AdMyType “Pet”TargetType “Buyer”Requirements DogLover ? TrueRank 0PetType “Dog”Color “Brown”Price 75Breed "Saint Bernard"Size "Very Large".Buyer AdMyType “Buyer”TargetType “Pet”Requirements (PetType “Dog”) &&(TARGET.Price MY.AcctBalance) &&(Size "Large" Size "Very Large")Rank (Breed "Saint Bernard")AcctBalance 100DogLover True.Dog Resource Machine196/21/21Marco Mambelli GlideinWMS introductionBuyer Job

HTCondor componentsCentral ManagerNegotiatorCollectorPull list of idlejobsSend Machinesproperties (classAds)PushkeepalivesSchedulercondor submit name of file Submit Node (Job Repo) 206/21/21Marco Mambelli GlideinWMS introductionFile TransferMechanismExecute Node(Machine)

HTCondor components (daemons)Central ProcdSubmit Node (Job Repo)216/21/21Marco Mambelli GlideinWMS introductionStarterJobExecute Node (Machine)

HTCondor building blocks in Glidein WMS The Factory works with an HTCondor pool, WMS pool, tosubmit Glideins to different resources The HTCondor Glideinsare pilots that launch astartd that registers on asecond HTCondor pool,User pool User jobs are matched andexecute on the resources The Frontend monitors theuser schedds and notifiesthe Factory about theneed for more Glideins22Marco Mambelli GlideinWMS introduction6/21/21

Glideins run on Machines This is a machine (worker node, host, node, resource),managed by a (Local) Resource Manager More frequently virtual than not Characterized by its resources (dimensions):Machine– CPUs (or total number of cores)– RAM (memory)– DiskSlot There can be other special resources that the node provides:GPUs, access to devices, software, The Glidein will receive all the node or part of it Sometime is not easy to identify everything used by a job236/21/21Marco Mambelli GlideinWMS introduction

Job and Machine ‘dimensions’ Job request– request cpus: number of cores, integer, default 1.– request disk: amount of disk space in Kbytes, default to sum ofsizes of the job's executable and all input files (or image size)– request memory: amount of memory space in Mbytes, defaultto executable size Machine– Cpus: number of cores, integer, by default the available cores– Disk: amount of disk space on this machine available for the jobin KiB, by default the available space– Memory: amount of RAM in MiB in this slot Over and Under provision are possible246/21/21Marco Mambelli GlideinWMS introduction

Summary Your jobs can run on many different resource types– Many have specific advantages/limitations GlideinWMS and HEPCloud help moving jobs around usingGlideins HTCondor is used in many components Test your jobs locally Specify all the requirements256/21/21Marco Mambelli GlideinWMS introduction

-Network access -Familiarenvironment -Local support Where jobs run . Distributed architecture-Factory-Frontend-Glidein GlideinWMS Frontend Factory Job Queue Worker Worker Worker Worker . GlideinWMS introduction 6/21/21 This is a machine (worker node, host, node, resource), .

Related Documents:

Cloud Computing J.B.I.E.T Page 5 Computing Paradigm Distinctions . The high-technology community has argued for many years about the precise definitions of centralized computing, parallel computing, distributed computing, and cloud computing. In general, distributed computing is the opposite of centralized computing.

Chapter 10 Cloud Computing: A Paradigm Shift 118 119 The Business Values of Cloud Computing Cost savings was the initial selling point of cloud computing. Cloud computing changes the way organisations think about IT costs. Advocates of cloud computing suggest that cloud computing will result in cost savings through

10. Case Notes Template 11. Navigation Care Plan Form 12. Treatment Planning Form Navigation Tools 13. Health Promotion Guide 14. Know Hep C, Cure Hep C Pocket Card 15. Hep C Steps to Care and Cure Handout 16. Appointment and Patient Rights Pocket Card 17. Keeping in Contact Handout

Pranahita Chevella LIS ARUNACHAL PRADESH 17. Pare HEP ASSAM 18. Brahmaputra Bridge . Project 206. Tala HEP, Package C-1 207. Tala HEP, Package C-4 208. Punatsangchhu HEP -

Mobile Cloud Computing Cloud Computing has been identified as the next generation’s computing infrastructure. Cloud Computing allows access to infrastructure, platforms, and software provided by cloud providers at low cost, in an on-demand fashion. Mobile Cloud Computing is introduced as an int

Cloud Computing What is Cloud Computing? Risks of Cloud Computing Practical Applications Benefits of Cloud Computing Adoption Strategies 5 4 3 2 1 Q&A What the Future Holds 7 6 Benefits of Cloud Computing Reduced Cost for Implementation Flexibility Scalability Disaster Relief Multitenancy Virtualization Pay incrementally Automatic Updates

UNIT 5: Securing the Cloud: Cloud Information security fundamentals, Cloud security services, Design principles, Policy Implementation, Cloud Computing Security Challenges, Cloud Computing Security Architecture . Legal issues in cloud Computing. Data Security in Cloud: Business Continuity and Disaster

Cloud computing "Cloud computing is a computing paradigm shift where computing is moved away from personal computers or an individual application server to a "cloud" of computers. Users of the cloud only need to be concerned with the computing service being asked for, as the underlying details of how it is achieved are hidden.