A Novel Multi-CPU/GPU Collaborative Computing Framework For SGD-based .

1y ago
8 Views
1 Downloads
1.11 MB
24 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Brenna Zink
Transcription

A Novel Multi-CPU/GPU Collaborative ComputingFramework for SGD-based Matrix FactorizationYizhi Huang*†, Yinyan Long†, Yan Liu*,Shuibing He‡, Yang Bai*, Renfa Li**†‡

Outline Background and Motivation Design and Implementation EvaluationINTERNATIONALCONFERENCE ONPARALLELPROCESSING50th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, IL

Background Matrix Factorization: can help recommender systems predicted user’s preferences toproducts. SGD-based MF5.04.54.01.05.0 2.03.51.53.55.03.5 𝑘 4.5Users’ratings ofitemsRating Matrix RUpdating feature matrix P ,Q by SGD21 ( p i , q j ) ( ri , j p i q j ) 1 P2 ( p i , q j )pi pi p iqj qj INTERNATIONALCONFERENCE ONPARALLELPROCESSING ( p i , q j )2 2 Q2Iteration𝑘User Matrix PItem Matrix QPredicted Rating Matrix 𝑹𝑷Each score 𝒓 will be used to update two kdimensional vectors 𝒑, 𝒒Need to accelerate SGD-based MF q j50th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, IL

Observation: the Under-utilized CPUs Many computing nodes have multi-CPUs/GPUsGPUsGPUs Existing researches more willing to manage the GPUs forcomputing CPUs’ computing power is easily overlooked Is it possible to cooperate with the CPUs to accelerate SGDbased MF ?CPUsCooperatively acceleratingSDG-based MF?INTERNATIONALCONFERENCE ONPARALLELPROCESSING50th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, IL

ObservationGood collaboration6242-2080S6242-2080Tesla V100RTX 2080SRTX 2080GPUCPUCPU GPU1.5921.745GPU6242-2080S32726242-20803272Tesla V1001.4991.932.21Intel Xeon Gold 62425.449Time Cost (s)CPU9000RTX 2080S699RTX 2080699Intel Xeon Gold 62422573Price ( ) The performance of high-end GPUs does not increase linearly with price Cooperative computing of CPU and GPU may bring a goodprice/performance ratioINTERNATIONALCONFERENCE ONPARALLELPROCESSING50th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, IL

ChallengesUnbalanced load leads to short board effectBad collaborationGood collaboration6242-2080S(Unbalanced data )4.252 How to uniformly manage and transparentlyuse heterogeneous CPUsand GPUs ?6242-2080S(Bad communication)2.566 How to design appropriate data distribution?6242-2080S6242-2080 How to optimize communication inter-CPUs/GPUs?1.5921.745Time Cost (s)Heterogeneous𝑅𝑚 𝑛 𝑃𝑚 𝑘 𝑄𝑘 𝑛Naïve Communication Cost: 𝑚 𝑛 𝑘 𝑠𝑖𝑧𝑒𝑜𝑓(𝑓𝑙𝑜𝑎𝑡) 𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 Τ𝐵𝑏𝑢𝑠Netflix: 𝑚 480190, 𝑛 17771, 𝑘 128, 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 20, 𝑐𝑜𝑠𝑡 0.4𝑠INTERNATIONALCONFERENCE ONPARALLELPROCESSING50th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, IL

Outline Background and Motivation Design and Implementation EvaluationINTERNATIONALCONFERENCE ONPARALLELPROCESSING50th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, IL

Our solution: HCC-MFProblem 1How to transparentize heterogeneousCPUs and GPUsProblem 2How to distribute data to eachheterogeneous CPU/GPU to makethe whole system more efficient ?Problem 3How to optimize communicationInter-CPUs/GPUs ?INTERNATIONALCONFERENCE ONPARALLELPROCESSINGA general framework that unifies the abstractionand workflow- A time cost model for guiding data Distribution.- Two data partition strategies to deal withdifferent synchronization overhead conditionsCommunication optimization strategies thatreduce the amount of data transmission and usecomputation to overlap communication50th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, IL

HCC-MF Heterogeneous CPUs/GPUs are abstracted intoworker processesRating MatrixRow grid 0Q06P0Time Cost ModelASGDTaskRow grid 112DataManagerCOMMServerWorker 1SyncCOMM2push bufferTaskASGD6P1Q1DataManager4Q Use shared memory as a COMM channelbetween processes Server assigns data to workers, workersasynchronously calculate SGD-based MFPush7PGrid 0Worker 05Pullpull bufferpush bufferDataManagerCOMMData Partition Strategy3Grid 13 Workers: Pull - Computing - Push𝑝 Servers: Synchronization σ𝑖 1 𝑃𝑖 𝑄𝑖 /𝑝INTERNATIONALCONFERENCE ONPARALLELPROCESSING50th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, IL

Our solution: HCC-MFProblem 1How to transparentize heterogeneousCPUs and GPUsProblem 2How to distribute data to eachheterogeneous CPU/GPU to makethe whole system more efficient ?Problem 3How to optimize communicationInter-CPUs/GPUs ?INTERNATIONALCONFERENCE ONPARALLELPROCESSINGA general framework that unifies the abstractionand workflow- A time cost model for guiding data Distribution.- Two data partition strategies to deal withdifferent synchronization overhead conditionsCommunication optimization strategies thatreduce the amount of data transmission and usecomputation to overlap communication50th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, IL

Time Cost Model 𝑇 𝑚𝑎𝑥 𝑇𝑖 𝑇𝑠𝑦𝑛𝑐Omit performancerelated �𝑥 𝑇𝑖SimilarServerWorker 0Pull-PushPullWorker 1PullWorker 2PullWorker 3PullWorker 4PullINTERNATIONALCONFERENCE gComputingComputingCan sync be ignored ?PushPushPushPush50th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, IL

Data partition for load balanceServerWorker 0Worker 1Worker 2Worker 3Worker 4ComputingComputingComputing𝑥𝑖 𝑛𝑛𝑧 16𝑘 42𝑘 𝑚 𝑛 𝐵𝑖𝐵𝑏𝑢𝑠 𝑖𝜃 x 𝑚𝑖𝑛 𝑚𝑎𝑥 𝐴𝑥 𝐵ComputingComputingServerAssuming 𝐵𝑖 is a constant function of 𝑥𝑖𝑎1 𝑥1 𝑏1 𝑎2 𝑥2 𝑏2 𝑎𝑛 𝑥𝑛 𝑏𝑛 , 𝜃 is the minimumWorker 0ComputingWorker 1ComputingWorker 2ComputingWorker 3ComputingWorker 4ComputingINTERNATIONALCONFERENCE ONPARALLELPROCESSING𝜃 x 𝑚𝑖𝑛 𝑇 min 𝑚𝑎𝑥𝑏1 𝑏2 𝑏𝑛Can DP0 really guaranteeload balance?DP0 : 𝑥𝑖 50th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, IL11 𝑎σ𝑝𝑗 1 𝑖 σ𝑝 𝑇𝑖 𝑒𝑎𝑗𝑗 1 𝑇𝑗 𝑒

Data partition for load balancenot balanced The assumption of 𝐵𝑖 is not true 𝑥 The Runtime performance may not beignoredDifferential𝑖𝑓 𝑥 𝑖𝑠 𝑠𝑚𝑎𝑙𝑙, 𝑇 𝑐𝑎𝑛 𝑏𝑒 𝑟𝑒𝑔𝑎𝑟𝑑𝑒𝑑 𝑎𝑠 𝑙𝑖𝑛𝑒𝑎𝑟Few iterationsDP0INTERNATIONALCONFERENCE ONPARALLELPROCESSINGAlgorithm 1DP150th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, IL

Data partition: hiding synchronization𝑇 max𝑥𝑖 𝑛𝑛𝑧 16𝑘 42𝑘 𝑚 𝑛 𝐵𝑖𝐵𝑏𝑢𝑠𝑖 3𝑡𝑘(𝑚 𝑛)𝑡 is a nonlinear function of 𝑥𝐵𝑠𝑒𝑟𝑣𝑒𝑟Difficult to solve the objective functionUse DP1 to balance the computational overhead of each worker𝑇1 𝑇2 𝑇𝑛Use calculation to hide synchronization overheadcDP1-- DP2ServerWorker 0Worker 1Worker 2Worker 3Worker 4INTERNATIONALCONFERENCE ONPARALLELPROCESSING50th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, IL𝑇(𝑖 𝑛) 𝑇𝑖 𝑛𝑇𝑖 𝑠𝑦𝑛𝑐

Our solution: HCC-MFProblem 1How to transparentize heterogeneousCPUs and GPUsProblem 2How to distribute data to eachheterogeneous CPU/GPU to makethe whole system more efficient ?Problem 3How to optimize communicationInter-CPUs/GPUs ?INTERNATIONALCONFERENCE ONPARALLELPROCESSINGA general framework that unifies the abstractionand workflow- A time cost model for guiding data Distribution.- Two data partition strategies to deal withdifferent synchronization overhead conditionsCommunication optimization strategies thatreduce the amount of data transmission and usecomputation to overlap communication50th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, IL

Reduce data transmissionRows(columns) are independent of each other Transmitting Q matrix only 𝑘𝑘The data range of the rating matrix is limitedRating Matrix RUser Matrix PTransmitting FP16 DataINTERNATIONALCONFERENCE ONPARALLELPROCESSING50th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, ILItem Matrix Q

Overleap communicationMultiple Asynchronous computing-transmission streams in workerGPU: copy engineCPU: multithreads and free bandwidthSoC: copy engine in iGPUINTERNATIONALCONFERENCE ONPARALLELPROCESSING50th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, IL

Outline Background and Motivation Design and Implementation EvaluationINTERNATIONALCONFERENCE ONPARALLELPROCESSING50th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, IL

Evaluation SetupItemHardwareDataSetBaselineContent2 Intel(R) Xeon(R) Gold 6242, Nvidia RTX 2080S, Nvidia Rtx 2080Netflix, Yahoo Music R1, R2, R1*, Movielens-20mFPSGD and cuMF SGD we implemented We do not change the core idea of the baseline algorithm in our implementation We optimized the code to make the baseline execute faster We use baseline as the kernel running on the workerINTERNATIONALCONFERENCE ONPARALLELPROCESSING50th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, IL

Overall performanceSame convergence rateFaster training speedINTERNATIONALCONFERENCE ONPARALLELPROCESSINGNetflixR150th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, ILR2

Data partition evaluationDP0 can only guarantee load balancing on similar processors-12.2%DP1 can guarantee load balance on all processors Netflix-4workers: -12.2% R2-4workers: -10%-10%DP2 can hide synchronization overhead R1*-4workers: -12.1%-12.1%INTERNATIONALCONFERENCE ONPARALLELPROCESSING50th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, IL

Communication optimizationWithout any communication optimization, the communication overheadwill offset the benefits brought by parallelismQ can achieve better optimization results, but the effectiveness dependson the shape of the rating matrixThe transmission performance of half-q is more than twice that of QINTERNATIONALCONFERENCE ONPARALLELPROCESSING50th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, IL

ConclusionHCC-MF: A heterogeneous multi-CPU/GPU collaborative computing framework forSGD-based matrix factorization Unified workflow and transparent heterogeneous CPUs/GPUs usage Data distribution algorithm for different synchronization conditions Optimal inter-CPUs/GPUs communicationLimitation (Under study): Communication overhead can be further optimized Server bottleneckINTERNATIONALCONFERENCE ONPARALLELPROCESSING50th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, IL

Thank youYizhi Huanghuangyizhi @hnu.edu.cnINTERNATIONALCONFERENCE ONPARALLELPROCESSING50th International Conference on Parallel Processing (ICPP)August 9-12, 2021 in Virtual Chicago, IL

50th International Conference on Parallel Processing (ICPP) August 9-12, 2021 in Virtual Chicago, IL Many computing nodes have multi-CPUs/GPUs Existing researches more willing to manage the GPUs for computing CPUs' computing power is easily overlooked Is it possible to cooperate with the CPUs to accelerate SGD-based MF ? CPUs GPUs .

Related Documents:

Adaptive MPI multirail tuning for non-uniform input/output access. EuroMPI'10. CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU . F. Broquedis et al., HWLOC : A generic framework for managing hardware affinities in HPC applications. PDP '10. (2) D. Callahan, et al., Compiling Programs for Distributed Memory Multiprocessors.The .

79 85 91 97 3 9 5 GPU r) U r (W) e) ex r A15 r rVR 4 U L2 Cache DRAM Cortex-A15 Quad CPU 0 CPU 1 CPU 2 CPU 3 L2 Cache PowerVR SGX544 GPU Cortex-A7 Quad CPU 0 CPU 1 CPU 2 CPU 3 Multi-layer BUS Figure 1: Exynos 5 Octa SoC simplified block diagram. However, 3D games are highly demanding of computational re-sources as well as memory bandwidth on .

CPU 315-2 PN/DP 6ES7315-2EH13-0AB0 V2.6 CPU 317-2 DP 6ES7317-2AJ10-0AB0 V2.6 CPU 317-2 PN/DP 6ES7317-2EK13-0AB0 V2.6 CPU 319-3 PN/DP CPU 31x 6ES7318-3EL00-0AB0 V2.7 . SIMATIC S7-300 CPU 31xC and CPU 31x: Specifications CPU 31xC and CPU 31x: Specifications 4 Manual .

OpenCV GPU header file Upload image from CPU to GPU memory Allocate a temp output image on the GPU Process images on the GPU Process images on the GPU Download image from GPU to CPU mem OpenCV CUDA example #include opencv2/opencv.hpp #include <

CPU VS GPU A GPU is a processor with thousands of cores , ALUs and cache. S.N O CPU GPU 1. CPU stands for Central Processing Unit. While GPU stands for Graphics Processing Unit. 2. CPU consumes or needs more memory than GPU. While it consumes or requires less memor

1 mm 3 mm 5 mm 7 mm 9 mm 11 mm 13 mm 15 mm 17 mm AMDFSA Config Figure 6: CPU -- GPU Power Sharing While the CPU is the hot spot on the die, a 1W reduction in CPU power allows the GPU to consume an additional 1.6W before the lateral heat conduction from CPU to GPU heats the CPU enough to be the hot spot again. As the GPU

Introduction to GPU Computing . CPU GPU Add GPUs: Accelerate Science Applications . Small Changes, Big Speed-up Application Code GPU Use GPU to Parallelize CPU Compute-Intensive Functions Rest of Sequential CPU Code . 3 Ways to Accelerate Applications Applications Libraries “Drop-in” Acceleration Programming

transplant a parallel approach from a single-GPU to a multi-GPU system. One major reason is the lacks of both program-ming models and well-established inter-GPU communication for a multi-GPU system. Although major GPU suppliers, such as NVIDIA and AMD, support multi-GPUs by establishing Scalable Link Interface (SLI) and Crossfire, respectively .