UCLA Health: High-Performance Computing Build-Out

1y ago
5 Views
1 Downloads
2.42 MB
38 Pages
Last View : 1m ago
Last Download : 2m ago
Upload by : Roy Essex
Transcription

UCLA Health: High-PerformanceComputing Build-OutPaul Boutros, Paul Tung, Takafumi Yamaguchi

Pathway Once upon a time in the snow Arriving & Planning Big Ideas What Happened!

2018: Happily Embedded in TorontoEducation: University of Toronto (2014-2016)Executive MBA, Rotman School of Management University of Toronto (2004-2008)PhD, Department of Medical Biophysics University of Waterloo (2000-2004)B.Sc, Chemistry, Honors Co-operative EducationToronto was home, my lab wasrunning well so. No excitementplanned for a few years!40-person data science team 1 PB of Storage 5000 CPU clusterDozens of active research projectsMTAs/DTAs/IP/etc./etc.

Problem: Localized Disease Outcomes Vary

A Solution: Prognostic & Predictive BiomarkersLessMore

My Research: Turning Data Into InsightLess3. Multi-Modal Data IntegrationMore1. Early tumour evolution2. Biomarker Development

A Decision: Toronto Los AngelesDirector Cancer Data Science – JCCC Create a framework to mine UCLAJCCC’s Big Data HIPAA-compliant compute Cutting edge Machine-Learning World leaders in genomics &computational oncologyAssociate Director, CancerInformatics – IPH Drive big data innovation from researchto clinic and back Health system scale, linked toeconomics & implementation research

Pathway Once upon a time in the snow Arriving & Planning Big Ideas What Happened!

Meeting #0: Compute PlanningBefore I arrived Started monthly meetings around compute planning Bi-weekly once I arrived Weekly once my team arrived Started & completed capacity estimation Identifying all data, servers, code Thinking about IP – both patents & copyrightOnce I arrived Meeting #1: DGIT & OHIA for compute planning! Priority #1: team coordination infrastructure Priority #2: storage to relocate data assets

Pathway Once upon a time in the snow Arriving & Planning Big Ideas What Happened!

What Did We Want to Achieve?PrincipleConsequenceDon’t treat this as a one-off. Set up the framework for a longterm consistent tempo of recruiting computational faculty.Need to productize throughout, and take the time to developa cohort of skilled staff throughout UCLA & UCLA Health.PrincipleConsequenceKeep a focus on security: in-depth, systematic, designed in.Close interactions with compliance, infrastructure, etc.PrinciplePrinciplePlan for ongoing technological evolution.Develop a close relationship with technology partners.

Specific NeedsSoftware Environment Team coordination software Wiki, issue-tracking Software management Import 1-million lines of code & history!Compute Environment HIPAA-compliant Scalable to arbitrary storage Support arbitrary compute (GPU, CPU, FPGA, etc.) Logging & traceability Standardized software across environments Lock-down of remote access to specific nodes Predictable upgrade tempo (quarterly) Real-time monitoring Framework for job-tracking, delegation, prioritization Burstable to arbitrary size Open-source pipelines for standardized analyses Productized: easy to reproduce for other uses Routine benchmarking & updating of methods Cost-conscious Etc. Etc.

Pathway Once upon a time in the snow Arriving & Planning Big Ideas What Happened!

14Big Ideas to Big Infrastructure

Big Ideas meets Big Infrastructure Introduction Use Case: HPC Environment for Boutros Labs Ideal fit: Cloud (Azure) Checked off all the required boxes Microsoft acquired some of the best companies Cycle Computing Avere Systems

What is an HPC Cluster? Set of Computing nodes set up to work togetherto perform complex demanding tasks At a minimum, an HPC Cluster contains ascheduler node and compute nodes

Autoscaling Ability to: Scale-out to any number of compute nodesrequired to complete a task Scale-in when tasks are complete. Cluster is down-sized Compute nodes are de-provisioned Cost savings as you're only charged for the timethe compute nodes actually provisioned. CycleCloud works in conjunction with thescheduler to autoscale compute nodes

POC - Preparation Worked with different departments within UCLA, DGIT and ISSSecurity to ensure we had the right resources for the POC Microsoft met and worked with us on-site to set up a POC with aCycleCloud Server and Avere Storage Cache in Azure Initial Setup SGE- Slurm Scheduler Docker enabled 10 compute nodes Small set of users

HPC Cluster – Behind the scenes

User Experience - Scheduler and Worker Nodes1. Users Log in tothe Scheduler NodeSSH2. Users submit a job tothe cluster3. CycleCloud dynamically scales upcompute nodes to process jobs, thendynamically scales down computenodes.

Admin Experience – Cluster Management Cluster Management and Monitoring: Cyclecloud Admin Portal Cyclecloud CLI Slurm Administration: SSH Client Supporting infrastructure (HPC Cache, MariaDB,etc) : Azure portal Azure CLI

Platform Features Choice of VM sizes depending on workload – Multiple VMsizes in the same cluster (Partitions) Integrated ability to use source repos and Docker Custom software suites and installations Ability to handle scaling to extreme capacities (21,600cores) Multiple storage mounts and tiered storage Custom scratch space and RAID arrays for betterperformance Performance Monitoring Dashboards Stability and efficiency enhancements

Where we’re going in the future Blueprint for HPC Platform Repeatable Scalable Modular Scaling to sizes beyond 21,600 cores Automating the deployment of the entireinfrastructure Working on CI/CD processes to createcustom images: hyper-streamlined image vs general purpose

HPC Platform – Blueprint

PerspectiveEntityTypeTotalProd/Non-Prod ClustersCluster TypeComputeDateUCLA Health HPC Clusters 1/0 1Cloud 720 Cores Storage – 250TBFebruary 2020UCLA Health HPC Clusters 3/1 4Cloud 60,000 CoresStorage – 2PBMay 2021UCLA Hoffman HPCCluster 1/0 1Physical 21,000 CoresStorage - 50TBBerkeley ResearchComputing HPC 1/0 1Physical 15,300 CoresStorage – 20TB

UCLA Health HPC PlatformUse CaseProdClustersTest/DevClustersDescriptionCompute NeedsBoutros LabProduction21 SlurmAzure HPC Cache1.2.3.12,000 cores targetSpread across two computing clusters20% annual expansion (2022 andbeyond)Boutros LabCOVID Use Case10 SlurmAzure HPC Cache1.2.14,400-21,600 cores target 1 yearIPH Regeneron11 SlurmAzure HPC Cache1.5,000-10,000 cores targetMDL HPC10 SlurmAzure HPC Cache1.100 cores targetIPH HPC11 SlurmAzure HPC Cache1.5,000-10,000 cores targetDGC HPCAnalytics Platform11 SlurmAzure HPC Cache1.5,000-10,000 cores target

Conclusion Always on the cutting edge, pushing the limits ofthe technology. Along the way, we found a lot of opportunities topartner with Microsoft to improve service offeringsto enhance the user experience and technology We'll also continue to refine our technologyofferings in partnership with our customers.

28Big Infrastructure to Big Data

Big Data needs Big Infrastructure Our mission is to cure cancer using “Big Data” Developbiomarkers and personalized treatment options HPC is our key infrastructure Download, Useprocess, analyze and store "Big Data"large-scale high-throughput molecular datasets “Big Data” gives us efficiency Understand Makeand optimize processeshigh confident decisions Answerimportant biological questions

Boutros Lab “Big Data” 250TB (February, 2020) - 2,000TB (May, 2021) Various high-throughput molecular datasets including Wholegenome sequencing (WGS) Wholeexome sequencing (WXS) RNA-seq(whole transcriptome sequencing) Epigenomicand proteomics data Cancer data 300GB per patient COVID-19 data 100GB per patient

How do we obtain “Big Data”? Generate sequencing data 2,000GBof sequencing data in 2 days!Ilumina NovaSeq 6000 Download open/controlled access datasets to the cluster storage

How do we process “Big Data”? Nextflow pipelines Scalableand reproducible parallel workflows on clouds and clusters Boutroslab DNA Nextflow pipelines Boutroslab RNA Nextflow pipelines The pipelines can process not only cancer datasets but also anyDNA-seq/RNA-seq datasets (e.g. COVID-19 patient WGS)

33COVID-19 Genetic PredispositionStudy

Investigate genetic role in COVID-19 susceptibilityand severity Program with five UC medical centers UCLA, UCSF, UCSD, UCD, and UCI Analyzing WGS data of 702 patients with diverse backgrounds Toidentify genetic risk and protection factors linked with symptoms Topredict risk for infection and develop new treatment Integrating the data into two global consortia COVID Human Genetic Effort COVID-19 Host Genetics Initiative

COVID-19 WGS samples Received 702 WGS samples ( 50TB) from the five UC centersUC# FASTQ# 24

Cluster scaling We scaled up the cluster size on an on-demand basis Reserved Adjusted300 worker nodes ( 21,600 CPUs)HPC Cache size and throughput Why scaled up the cluster? Urgency To fight against the pandemic and process all samples as soon as possible Cost efficiencyTo reduce the total cost of the whole HPC environment by saving runningcost

Cluster scaling result We successfully deployed300 worker nodes 21,600 CPUs processedthe 702 WGS samples using the DNA pipelines in 10 days generatedgermline data for the 702 samples Single Nucleotide Polymorphisms (SNP) Structural Variations (SV) and Copy Number Variations (CNV) Mitochondria SNPs Identifiedfeatures to improve our cluster infrastructure andimplemented updates

Remarkable Progress: Lots More Fun to Come!Software Environment Team coordination software Wiki, issue-tracking Software management Import 1-million lines of code & history!Compute Environment HIPAA-compliant Scalable to arbitrary storage Support arbitrary compute (GPU, CPU, FPGA, etc.) Logging & traceability Standardized software across environments Lock-down of remote access to specific nodes Predictable upgrade tempo (quarterly) Real-time monitoring Framework for job-tracking, delegation, prioritization Burstable to arbitrary size Open-source pipelines for standardized analyses Productized: easy to reproduce for other uses Routine benchmarking & updating of methods Cost-conscious Etc. Etc.

Cost-conscious Etc. Pathway Once upon a time in the snow Arriving & Planning . Boutros Lab COVID Use Case. 1. 0 Slurm Azure HPC Cache. 1. 14,400-21,600 cores target 2. 1 year. IPH Regeneron. 1: 1 Slurm Azure HPC Cache 1. 5,000-10,000 cores target: MDL HPC 1: 0 Slurm

Related Documents:

“Enumerative Combinatorics” (Spring 2016), UCLA Math 184 “Combinatorial Theory” (Fall 2012-16, 18-19, Win 2013-18), UCLA Math 206AB “Tilings” (Spring 2013), UCLA Math 285 “Introduction to Discrete Structures” (Fall 2012-13, Spring 2015, 2017), UCLA Math 61 “Combinatorics” (Spring 2011, 2012, 2014), UCLA Math 180 “Combinat

EMBA: apply.anderson.ucla.edu/apply UCLA-NUS Executive MBA: ucla.nus.edu.sg PROGRAM CONTACT INFORMATION UCLA Anderson School of Management 110 Westwood Plaza Los Angeles, CA 90095 Full-time MBA: (310) 825-6944 mba.admissions@anderson.ucla.edu FEMBA: (310) 825-2632 femba.admissions@anderson.ucla.edu

Cloud Computing J.B.I.E.T Page 5 Computing Paradigm Distinctions . The high-technology community has argued for many years about the precise definitions of centralized computing, parallel computing, distributed computing, and cloud computing. In general, distributed computing is the opposite of centralized computing.

distributed. Some authors consider cloud computing to be a form of utility computing or service computing. Ubiquitous computing refers to computing with pervasive devices at any place and time using wired or wireless communication. Internet computing is even broader and covers all computing paradigms over the Internet.

Archaeological Survey (U. Montana) Teaching Assistant (Fall 2011) HONORS AND AWARDS 2017 International Institute Fieldwork Fellowship, UCLA 2015 International Institute Fieldwork Fellowship, UCLA 2014 Graduate Summer Research Mentorship, UCLA 2014 Steinmetz Award for travel, UCLA

Are Executive Stock Option Exercises Driven by Private Information? By David Aboody daboody@anderson.ucla.edu John Hughes jhughes@anderson.ucla.edu Jing Liu jiliu@anderson.ucla.edu and Wei Su wsu@anderson.ucla.edu 110 Westwood Plaza, Suite D403 Anderson School of Management University of California, Los Angeles Los Angeles, CA 90095

OLLI at UCLA provides a unique opportunity for individuals age 50 to engage in an extensive program of noncredit courses and special programs amongst a dynamic community of lifelong learners. OLLI at UCLA membership gives you access to: Renew or join OLLI at UCLA today! OLLI at UCLA membership begins the quarter you join and lasts for one full .

UCLA Engineering News 15 Student News 21 UCLA Alumni News 24 2005-06 Honor Roll of Donors 28 UCLA Schoolwide Technology Forum DeNeve Plaza Monday, May 3, 2007 UCLA Alumni Reception sponsored by Google San Francisco Bay Area Monday, May 7, 2007 Student Projects Reception James West Alumni Center Thursday, May 17, 2007 Young Alumni Reunion UCLA