Towards Parallel Access Of Multi-dimensional, Multi- Resolution .

1y ago
1 Views
1 Downloads
1.24 MB
5 Pages
Last View : 10d ago
Last Download : 3m ago
Upload by : Aarya Seiber
Transcription

Towards Parallel Access of Multi-dimensional, Multiresolution Scientific DataSidharth Kumar, Valerio PascucciSCI Institute, University of UtahSalt Lake City, UtahAbstract— Large scale scientific simulations routinely producedata of increasing resolution. Analyzing this data is key toscientific discovery. A critical bottleneck facing the analysis is theI/O time to access the data. One method of addressing thisproblem is to reorganize the data in a manner that simplifiesanalysis and visualization. The IDX file format is an example ofthis approach. It orders data points so that they can be accessedat multiple resolution levels with favorable spatial locality andcaching properties. IDX has been used successfully in fields suchas digital photography and visualization of large scientific data,and is a promising approach for analysis of HPC data.Unfortunately, the existing tools for writing data in this formatonly provide a serial interface. HPC applications must thereforeeither write all data from a single process or convert existing dataas a post-processing step, in either case failing to utilize availableparallel I/O resources.In this work, we provide an overview of the IDX file format andthe existing ViSUS library that provides serial access to IDXdata. We investigate methods for writing IDX data in paralleland demonstrate that it is possible for HPC applications to writedata directly into IDX format with scalable performance. Ourpreliminary results demonstrate 60% of the peak I/O throughputwhen reorganizing and writing the data from 512 processes on anIBM BG/P system. We also analyze the remaining bottlenecksand propose future work towards a more flexible and efficientimplementation.Keywords-Parallel IO, Multi dimensional data;I.INTRODUCTIONThe increase in computational power of supercomputers isenabling unprecedented opportunities to advance science innumerous fields such as climate science, astrophysics,cosmology and material science. These simulations routinelyproduce larger quantities of raw data. A key requirement is toanalyze this data and transform it into useful insight. A criticalbottleneck being faced by analysis applications is the I/O timeto read and write data to storage.IDX provides efficient, cache oblivious, and progressive accessto large-scale scientific data by storing the data in a hierarchicalZ order [1]. It makes it possible for scientists to interactivelyanalyze and visualize data of the order of several terabytes [2].IDX has been used successfully in fields such as digitalphotography [3] and visualization of large scientific data [2]and is promising for analysis of HPC data as well [4].Venkatram Vishwanath, Philip Carns, Robert Latham,Tom Peterka, Michael Papka, Robert RossArgonne National LaboratoryArgonne, IllinoisViSUS, an IDX API, is serial in nature which limits the use ofIDX to relatively small scale datasets. To overcome thisproblem, we have developed a parallel API to transform largescale scientific data to IDX format. It utilizes the computationresources of each compute node to efficiently calculate the HZordering. It then coordinates file system access using collectivecommunication to write the data set in parallel.Development of the parallel IDX API is the culmination ofobservations and experiments made over different versions of aprototype API. We began by evaluating the use of the existingViSUS library in a parallel environment. We then constructeda prototype API, called PIDX, that allows data to be written inparallel. By analyzing this prototype, we were able to identifyand address inefficiencies in both the I/O strategy and dataorder computation. The prototype API demonstrates that it ispossible for HPC applications to write data directly into IDXformat with scalable performance. We will use this prototypeas a platform for future work in developing a more flexible andefficient implementation.The remainder of this paper is organized as follows: Wepresent relevant background information on the IDX Dataformat in Section 2 and describe ViSUS, a serial IDX API, inSection 3. We present our work on writing IDX data in parallelin Section 4 and discuss performance optimization next. Weevaluate the performance of our parallel IDX prototype inSection 6 and finally conclude and discuss our plans for furtherresearch.II.IDX DATA FORMATIDX enables fast and efficient access to large scale scientificdata. In IDX, data is organized into multiple levels ofresolution, making it easy to query data of any desired size anddimension. Figure 1 depicts screenshots of a visualization toolbased on IDX being used to visualize a 530 MB IDX data setof a rat’s retina scan. Figure 1(a) corresponds to visualizingdata at the lowest resolution. This case requires querying a verysmall set of data. Figure 1(d) corresponds to visualizing data athigher resolution. This requires querying at multipleresolutions for a clipped viewing area. Figure 1(b) is the casewhere data visualized in (c) is zoomed with progressiveincrease in resolution whereas (c) corresponds to zoomingwithout any progressive increase of detail, producing holes inimages. Figure 1(e) and (f) are zoomed cross-sections from (b)and (c), and here the holes are clearly visible detectable.

(a)(b)(c)(e)(f)(d)Figure 1. (a) lowest resolution data; (b) progressive zoomed data at high resolution (c) zoomed data at low resolution (d) data at high resolution(e) zoomed cross-section from b, hence high resolution (f) zoomed cross-section from c, hence low resolution with equally placed holesHierarchical Z (HZ) ordering is the key idea behind IDX dataformat. IDX supports multi-dimensional data of arbitrarydimensions and sizes. HZ order computation requires thespatial coordinates of data samples. For instance, it requires thex, y and z coordinates for a three dimensional data set. Exactformulation of HZ ordering can be found at [1]. Data is thenreorganized into levels corresponding to the followingformulation:Level floor ((log2 (HZ index))) 1These levels correspond to different resolutions the data isrearranged into. Level-wise data is then stored in IDX formatdata files. From file structure point of view, IDX file formathas an .idx file that has all the required metadata (dimension,sample type, variable names and some more). The raw data isstored into a hierarchical level of binary files. The number offiles and the size of the files are configurable.TABLE I.TABLE SHOWING CONVERSION FROM X,Y,Z COORDINATES TOHZ e I demonstrates the conversion for a simple 2x2x2 volumeof data. The conversion of data to IDX format can beconsidered as converting n dimensional data to one dimension.This conversion to HZ ordering is a bijective function, which isa required condition for parallelization. As a result of HZordering and corresponding distribution of data into differentlevels of resolution, it becomes increasingly fast to query justthe required data set for analysis and visualization. Forinstance, there is little lag when zooming or panning a largescaled data at any desired rate. This is extremely critical forinteractive visualization and analysis of data.III.VISUS: A SERIAL IDX WRITERThe experiments in this paper were conducted on the SurveyorIBM Blue Gene/P (BG/P) system at the Argonne LeadershipComputing Facility (ALCF) at Argonne NationalLaboratory. Surveyor is a 4,096-core research anddevelopment system. Its storage subsystem consists of four fileservers running PVFS and a DataDirect Networks S2A9550SAN.The first goal in our effort to utilize IDX in this HPCenvironment was to develop a parallel application that woulduse ViSUS I/O to write directly into IDX format. Wedeveloped a microbenchmark that divides an entire datavolume into smaller 3D chunks, which each processindependently writes to an IDX data set. MPI barriers andtokens are used to maintain order amongst processes; a processwith rank r can write to an IDX file only after the process withrank r–1 has finished writing. The processes cannot writeconcurrently due to conflicts in updating metadata and blocklayouts.

IV.PIDX : PROTOTYPE API FOR PARALLEL IDXWRITESBased on our experience with the serialized ViSUS writer, wethen developed a prototype API for performing concurrent I/Oto an IDX data set. This API is called Parallel IDX (PIDX) andincludes functions patterned after ViSUS for creating, opening,reading, and writing IDX data sets. Each of the PIDXfunctions is a collective operation that accepts an MPIcommunicator as an argument.Figure 2. Jumpshot Image for the serialized ViSUS IDX writer using 64nodes and a 64 MiB data setWe used MPE and Jumpshot [5] to understand the I/O patternsof the ViSUS microbenchmark,. Figure 2 depicts the Jumpshotprofile of 64 processes writing an IDX file using ViSUS. Eachhorizontal line represents a process, where the yellow regionscorrespond to MPI barrier waits and black regions correspondto time spent writing data. As expected, we notice that a largeportion of the runtime for each process is spent waiting for theI/O token.The aggregate bandwidth for this benchmark on 64 processesas we increase the amount of data is illustrated in Figure 3.The efficiency improves as the aggregate volume to write isincreased to 8 GiB, but its maximum performance is only 9.5MiB/s. Using IOR, a widely adopted benchmark for parallelfilesystems, we obtain a peak performance of 218MiB/s for 64processes writing a total of 8GiB. Thus, we are able to achieveonly 4% of the maximum throughput. This is expected asViSUS is serial in nature and the various processes take turnsto write the data out.In both the ViSUS and PIDX API’s, the dimensions andmaximum volume of the data set is defined when the file iscreated. We therefore use the collective create function as anopportunity to pre-create all of the metadata file, subdirectories,and (initially empty) binary files that constitute the IDX dataset. The rank 0 process is responsible for populating themetadata file and directory hierarchy. The work of creating theempty binary files is then distributed across all processes.Once the data set is created, the PIDX write function can beused to collectively write arbitrary sub-volumes of data fromeach process. The data in this case is provided in the form of acontiguous, row-major ordered data buffer. Each process mustcalculate an HZ ordering for this sub-volume, reorder the datapoints accordingly, and write those data points to interleavedportions of the IDX data set. For prototype purposes, the PIDXlibrary simply copies the sub-volume into an intermediatebuffer when reordering. It also generates an index into thatbuffer indicating each level of the HZ hierarchy. Each level isthen written in turn to the IDX data set using independent MPII/O write operations. Data within a single level is typicallycontiguous in file.V.OPTIMIZATION STRATEGIESThe PIDX prototype described in the previous section greatlyimproved I/O performance over serial use of the ViSUSFigure 4. Jumpshot profile of a process writing IDX file without FileDescriptor Caching. Pink depicts the file open time and the last three pinkcolumns depict the redundant file opens being performed.Figure 3. Performance of serialized ViSUS IDX writer using 64 processes asthe volume of data is varied. The ViSUS IDX writer is able to achieve only4% of the throughput achieved by IORFigure 5. Jumpshot profile of a Process writing IDX file with FileDescriptor Caching. The redundant file opens are eliminated via file descriptorcaching

library. Jumpshot analysis revealed a number of inefficiencies,however. Figure 4 illustrates one example. This viewhighlights the time spent by rank 0 when writing data into thefour initial HZ levels. The pink regions represent file opentime. There is an initial expensive file open corresponding tocreation of a binary file at PIDX create time, which cannot beavoided. However, it is also evident that a significant amountof time is spent in a sequence of four subsequent file openoperations. This is because for each HZ level, PIDX identifiesthe appropriate binary file, opens it, writes a contiguous set ofsamples, and closes the file. However, the first two HZ levelsonly contain a single data point, the third level contains 2 datapoints, the fourth level contains 4 data points, and the numberof data points doubles every successive level. In the initial HZlevels, the I/O cost was dominated by time spent opening thefile.In order to mitigate this overhead, we implemented a filehandle caching mechanism in our prototype. When anyprocess opens an underlying binary file, it holds the MPI filedescriptor open for future use and does not close it until all I/Ois complete. The result of this optimization is shown in Figure5 for the same data set. There is now only one file openoperation in the main write path, and performance is improvedaccordingly. Figure 6 depicts the performance improvementachieved with file handle caching over the defaultimplementation for 64 cores as we increase the total datavolume. We notice a significant improvement of up to 7-foldfor data volumes of 128MiB. However, we notice only amarginal improvement with higher data volumes. This isbecause a significant amount of the I/O time was spent in thecomputation to generate the HZ ordering. We performed adetailed analysis of the HZ computation using the IBM BG/Puniversal performance counters and indentified bottlenecksassociated with redundant computations as well as inadequateuse of the floating point double hummers. By overcomingthese, as depicted in Figure 6,we are able to achieve up to 75%improvement in I/O throughput over the file handleimprovements and up to a 10-fold improvement over thedefault implementation.Figure 7 shows the Jumpshot visualization of the PIDXmicrobenchmark when writing an IDX data set. The PIDXregions correspond to computation time, the pink regionscorrespond to file open time, and the purple regions correspondto I/O time. In contrast to Figure 2, which showed theserialized ViSUS writer, we are now able to efficiently utilizeall processes when writing.Figure 7. Jumpshot Image for the Parallel IDX (PIDX) writer using 64 nodesand a 16 MiB data setVI. PERFORMANCEWRITERANALYSISOF PARALLELIDXIn this section we evaluate the scalability of the PIDXprototype, both in terms of data volume and number ofprocesses. We begin by investigating the weak scalingbehavior of the PIDX library on Surveyor.For weak scaling, we used a constant load of 128 MB perprocess. The total load was then gradually increased linearly asthe number of processes was varied from 1 to 512, and theresults are shown in Figure 8. From these results we see that asingle process achieves approximately 6.85 MiB/s, comparableto the speed achieved by a serial writer for an equal volume ofdata. The peak aggregate performance of 406 MiB/s is reachedwith 512 processes. This is approximately 60% of the peakIOR throughput achievable (667MiB/s) on 512 cores ofsurveyor.Figure 6. The achievable PIDX throughput on 64 cores as we vary the totaldata volume written with the various optimizations. File Caching and HZComputation optimizations yield significant improvement in performanceFigure 8. PIDX write throughput with weak scaling

Table II depicts the results of an experiment investigatingstrong scaling properties. In this case, the total data volumewritten was fixed at 8 GiB and we varied the number orprocesses writing from 8 to 512. The Surveyor compute nodespossess only 2 GiB of ram per 4 cores, so we were unable toevaluate smaller examples. As depicted in the figure themaximum performance is reached when using 512 processes.TABLE II.PIDX STRONG SCALING FOR A TOTAL DATAVOLUME OF 8GIBNumber of ProcessesPIDXMiB/s64120.3512143.9ThroughputinIn both the weak and strong scaling examples, we found thatwe hit a limit on scalability with our current implementation,falling short of the peak surveyor write performance achievedby IOR. To investigate this issue further, we instrumented thetime required to write each level of the HZ ordering. Table IIIdepicts the time taken to write the various levels in the HZhierarchy for a 8 GiB total volume on 64 nodes. A 8 GiB datavolume consists of 30 HZ levels. We found that contentionand metadata overhead caused levels 0 through 6 to take adisproportionate amount of time relative to the amount of datathat they were writing. We plan to leverage aggregationstrategies to better coordinate I/O in the first few cases wheremany processes contribute data to the same level of the IDXdata set.provides similar functionality to the ViSUS implementation butwith an order of magnitude improvement in performance. Wehave also identified multiple optimizations that can be used toimprove performance on the BG/P platform.As noted in the previous section, we believe that furtheradvances in PIDX efficiency can be achieved through the useof aggregation strategies that limit contention at the file systemlevel. However, our current API is not ideal for this purpose.In future work we plan to revise the API in a manner thatdecouples the data model from the I/O mechanism. Inparticular, the application will describe the entire data set in itsentirety (including support for multiple variables anddiscontiguous memory) up front before writing the data set.This will allow the PIDX library to take as much informationas possible into account in order to schedule an efficienttransfer mechanism. We will investigate multiple aggregationstrategies using this platform.ACKNOWLEDGMENTThis work was supported by the Mathematical, Information,and Computational Sciences Division subprogram of the Officeof Advanced Scientific Computing Research, Office ofScience, U.S. Dept. of Energy, under Contract DE-AC0206CH11357. This research used resources of the ArgonneLeadership Computing Facility at Argonne NationalLaboratory, which is supported by the Office of Science of theU.S. Department of Energy under contract DE-AC0206CH11357.REFERENCES[1]TABLE III.TIME TAKEN TO WRITE THE VARIOUS IDXLEVELS FOR A 8 GIB DATA VOLUME ON 64 PROCESSESLevel TimeLevel TimeLevel Time(Sec)(Sec)(Sec)01.59693110.129122 190.13415200.1406100.13281210.2110VII.CONCLUSIONS AND FUTURE WORKIn this work we have shown that the IDX file format is apromising technology for analysis in scientific computing. Wehave also demonstrated that it is possible for simulation data tobe written directly into IDX format via a parallel API that[3][4][7]Pascucci, V., and Frank, R.J., Hierarchical indexing for out-of-coreaccess to multi-resolution data. Technical Report UCRL-JC-140581,Lawrence Livermore National Laboratory, 2001. A preliminary versionwas presented at the Lake Tahoe Workshop NSF/DOE Lake TahoeWorkshop on Hierarchical Approximation and Geometrical Methods forScientific Visualization.V. Pascucci and R.J. Frank. Global Static Indexing for Real-timeExploration of Very Large Regular Grids Conference on HighPerformance Networking and Computing archive proceedings of the2001 ACM/IEEE conference on Supercomputing (CDROM)B. Summa, G. Scorzelli, M. Jiang, P.T. Bremer, V. Pascucci, InteractiveEditing of Massive Imagery Made Simple: Turning Atlanta into Atlantis.ACM Transactions on Graphics - to appear, 2010V. Pascucci, D.E Laney, R.J. Frank, G. Scorzelli, L. Linsen, B. Hamann,And F. Gygi. 2003 Real-time monitoring of large scientific simulations.In ACM Symposiumon Applied Computing’03,ACMPress.W.Gropp, E.Lusk.”User’s Guide for MPE: Extensions for MPIPrograms”. 1998G. Bell, T. Hey, and A. Szalay. COMPUTER SCIENCE: Beyond theData Deluge. Science, 323(5919):1297, 2009.Samuel Lang, Philip Carns, Robert Latham, Robert Ross, Kevin Harms,and William Allcock. I/O performance challenges at leadership scale. InProceedings of Supercomputing, November 2009.

ViSUS library in a parallel environment. We then constructed a prototype API, called PIDX, that allows data to be written in parallel. By analyzing this prototype, we were able to identify and address inefficiencies in both the I/O strategy and data order computation. The prototype API demonstrates that it is

Related Documents:

In the heterogeneous soil model, OpenMP parallel optimization is used for multi-core parallelism implementation [27]. In our previous work, various parallel mechanisms have been introduced to accelerate the SAR raw data simulation, including clouding computing, GPU parallel, CPU parallel, and hybrid CPU/GPU parallel [28-35].

- multi-threading and/or multi-processing packages (parfor, mpi4py, R parallel, Rmpi, ) Using built in job submission - Matlab Parallel Server, rslurm, python Dask, snakemake Independent calculations in parallel . Parallel Computing Toolbox allows for task based

Parallel: N operations are (data) parallel -No dependencies -No need for complex hardware to detect parallelism -Can execute in parallel assuming N parallel functional units Expressive: memory operations describe patterns -Continuous or regular memory access pattern -Can prefetch or accelerate using wide/multi-banked memory

Series-Parallel Circuits If we combined a series circuit with a parallel circuit we produce a Series-Parallel circuit. R1 and R2 are in parallel and R3 is in series with R1 ǁ R2. The double lines between R1 and R2 is a symbol for parallel. We need to calculate R1 ǁ R2 first before adding R3.

The Series-Parallel Network In this circuit: R 3 and R 4 are in parallel Combination is in series with R 2 Entire combination is in parallel with R 1 Another example: C-C Tsai 4 Analysis of Series-Parallel Circuits Rules for analyzing series and parallel circuits apply: Same current occurs through all series elements

Series and Parallel Circuits Basics 3 5) Click the advanced tab and alter the resistivity of the wire. Record your observations. Click the reset button to begin working on a parallel circuit. Parallel Circuits 6) Parallel circuits provide more than one path for electrons to move. Sketch a parallel circuit that includes

quence, existing graph analytics pipelines compose graph-parallel and data-parallel systems using external storage systems, leading to extensive data movement and complicated programming model. To address these challenges we introduce GraphX, a distributed graph computation framework that unifies graph-parallel and data-parallel computation.

as: wall clock of serial execution - wall clock of parallel execution Parallel Overhead - The amount of time required to coordinate parallel tasks, as opposed to doing useful work. Parallel overhead can include factors such as: 1) Task start-up time 2) Synchronizations 3) Data communications Software overhead imposed by parallel compilers,