Improving I/O Performance Through The Dynamic

2y ago
47 Views
2 Downloads
296.16 KB
7 Pages
Last View : 8d ago
Last Download : 3m ago
Upload by : Camille Dion
Transcription

IEEE International Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications21-23 September 2009, Rende (Cosenza), ItalyImproving I/O Performance through the Dynamic Remapping ofObject SetsJeremy Logan1, Phillip Dickens212University of Maine, Orono, Maine, USA, jeremy.logan@maine.eduUniversity of Maine, Orono, Maine, USA, dickens@umcs.maine.eduAbstract - Our research has been investigating a newapproach to parallel I/O based on what we term objects. Thepremise of this research is that the primary obstacle toscalable I/O is the legacy view of a file as a linear sequence ofbytes. The problem is that applications rarely access theirdata in a way that conforms to this data model, using insteadwhat may be termed an object model, where each processaccesses a (perhaps disjoint) collection of objects. We havedeveloped an object-based caching system that provides aninterface between MPI applications and a more powerfulobject file model, and have demonstrated significantperformance gains based on this new approach. In thispaper, we further explore the advantages that can be gainedfrom using object-based I/O. In particular, we demonstratethat parallel I/O based on objects (termed parallel objectI/O) can be dynamically remapped. That is, one applicationcan output an object stream based on one object set, this canbe captured and translated into a different object set that ismore appropriate for another application. We demonstratehow such remapping can be accomplished, and provide anexample application showing that using this technique cansignificantly improve I/O performance.Keywords - Parallel I/O; High Performance Computing;Data-intensive applications; MPI-IO.I.INTRODUCTIONLarge-scale computing clusters are increasingly beingused to execute large-scale, data-intensive applications inseveral disciplines including, for example, high-resolutionsimulation of natural phenomenon, large-scale climatemodeling, earthquake modeling, visualization/animationof scientific data, and distributed collaboration. Theexecution of such applications is supported by state-ofthe-art file systems (e.g., Lustre [2], GPFS [18]) thatprovide tremendous aggregate storage capacity, and byparallel I/O interfaces that can interact with such filesystems to optimize access to the underlying store. Themost widely used parallel I/O interface is MPI-IO [4],which provides to the application a rich API that can beused to express complex I/O access patterns, and whichprovides to the underlying implementation manyopportunities for important I/O optimizations. Theproblem, however, is that even with all of this hardwareand software support, the I/O requirements of dataintensive applications are still straining the I/O capabilitiesof even the largest, most powerful file systems in usetoday. Thus new approaches are needed to support theexecution of current and next-generation data-intensiveapplications.There are many factors that make this problem,generally termed the scalable I/O problem, so challenging.The most often cited difficulties include the I/O accesspatterns exhibited by scientific applications (e.g., noncontiguous I/O [6, 7, 11]), poor file system support forparallel I/O optimizations [15, 16], strict file consistencysemantics [12], and the latency of accessing I/O devicesacross a network. However, we believe that a morefundamental problem, whose solution would help alleviateall of these challenges, is the legacy view of a file as alinear sequence of bytes. The problem is that applicationprocesses rarely access data in a way that matches this filemodel, and a large component of the scalability problem isthe cost of translating between the process data model andthe file data model. In fact, the data model used byapplications is more accurately defined as an objectmodel, where each process maintains a collection of(perhaps) unrelated objects. We believe that aligning thesedifferent data models will significantly enhance theperformance of parallel I/O for large-scale, data-intensiveapplications.This research is attacking the scalable I/O problem bydeveloping the infrastructure to merge the power andflexibility of the MPI-IO parallel I/O interface with a morepowerful object-based file model. Toward this end, wehave developed an object-based caching system thatserves as an interface between MPI applications andobject-based files. The object-based cache is based onMPI file views [3], or, more precisely, the intersections ofsuch views. These intersections, which we term objects,identify all of the file regions within which conflictingaccesses are possible and (by extension) those regions forwhich there can be no conflicts (termed shared-objectsand private-objects respectively). This information can beused by the runtime system to significantly increase theparallelism of file accesses and decrease the cost ofenforcing strict file consistency semantics and globalcache coherence.Previous research has shown that using the object-basedcaching system can lead to a significant increase inperformance compared to native MPI-IO [13] for theFLASH-IO parallel I/O benchmark [1]. However, we didnot at that time fully support the object file model. Inparticular, an object file created by one application couldonly be re-opened by another application with the sameobject set.

This issue can be thought of within the context of aproducer/consumer problem. One application produces aset of objects (e.g., creates an object file) that anotherapplication requires (the consumer). However, theconsumer requires a different object set. For example, along running application may checkpoint its stateinformation as a set of objects, terminate unexpectedly,and subsequently be restarted with a different number ofprocesses. Another example would be when an applicationchanges the file views upon which the current object set isbased. More generally, an application’s object set reflectsthe current file access patterns, and when access patternschange, new objects must be created that reflect suchchange. We refer to this as the dynamic remappingproblem.In this paper, we describe our approach to the dynamicremapping problem. It is based on the construction andutilization of interval trees, which store information aboutthe current object set. Logically, what we refer to as atranslator is placed between the producer and consumerapplications, which utilizes the information stored in aninterval tree to perform this translation.During the course of this research it has becomeapparent that the ability to perform the dynamicremapping of object sets can provide the foundation forother important capabilities. For example, one emergingcharacteristic of next-generation scientific applications istheir ability to adapt their behavior in response to changesin resource availability [10]. While there has beensignificant research investigating the remapping of thecomputational components of an application [14, 17], theissue of remapping its I/O component has been largelyignored. Dynamic remapping can also improve theperformance characteristics of applications even when itwould otherwise not be necessary to perform suchremapping. This could occur, for example, when aproducing application remaps the object set as it writes itto an object file such that it optimizes the performance ofthe consuming application.The primary contribution of this paper is thedevelopment of a technique to perform dynamicremapping of parallel I/O and a demonstration of theperformance enhancement that is available using thistechnique.II.BACKGROUNDMPI-IO is the IO component of the MPI standard [4]that was designed to provide MPI applications withportable, high performance parallel I/O. It provides a richand flexible API that provides to an application the abilityto express complex parallel I/O access patterns in a singleI/O request, and provides to the underlyingimplementation important opportunities to optimize accessto the underlying file system. It is generally agreed thatthe most widely used implementation of the MPI-IOstandard is ROMIO [20-22, 24], which was developed atArgonne National Laboratory and is included in theMPICH2 [5] distribution of the MPI standard. ROMIOprovides key optimizations for enhanced performance(e.g., two-phase I/O [9, 21]and data sieving[21-23]), andis implemented on a wide range of parallel architecturesand file systems. The portability of ROMIO stems from aninternal layer termed ADIO [22] (an Abstract DeviceInterface for parallel I/O) upon which ROMIOimplements the MPI-IO interface. ADIO implements thefile system dependent features, and is thus implementedseparately for each file system.A. MPI File ViewsAn important feature of MPI-IO is the file view [3],which maps the relationship between the regions of a filethat a process will access and the way those regions arelaid out on disk. A process cannot “see” or access any fileregions that are not in its file view, and the file view thusessentially maps a contiguous window onto the (perhaps)non-contiguous file regions in which the process willoperate. If its data is stored on disk as it is defined in thefile view, only a single I/O operation is required to movethe data to and from the disk. However, if the data isstored non-contiguously on disk, multiple I/O operationsare required.Contiguous “view window” in memoryView WindowFileP0 P1 P0 P1 P0 P1 P0 P1012345677Non-contiguous File Regions on DiskFigure 1: The view windowFigure 1 depicts a file region in which two processesare operating, and the data for each is laid out noncontiguously on disk. The file view for Process P0 isshown, which creates a contiguous “view window” of thefour data blocks it will access. Thus, the data model thatP0 is using is a contiguous file region, which conflictswith the file data model. Because of these conflictingviews, it will require four separate I/O operations toread/write its data from/to the disk. If it were stored ondisk as it is used by P0, such data accesses would require a

single I/O operation.File views contain valuable information about fileaccess patterns, and, when aggregated, show exactly thosefile regions in which contention is possible (there isoverlap between file views), and, by extension, thoseregions for which contention is not possible.B. ObjectsObjects represent non-overlapping file regions that canbe either private to a process (i.e., no other process willoperate in that particular region), or shared by a set ofprocesses. This distinction between shared objects andprivate objects has two very important ramifications: First,only shared objects must be locked, and such objectsrepresent the minimum overlap of shared data. Thiscompletely eliminates false sharing and provides themaximum possible concurrency for data access. Second,such information can simplify the locking mechanism andsignificantly increase its performance. This is becauseeach object manager knows exactly which processes canaccess its objects, and acts as a centralized lock managerfor those processes. Thus contention for write locks,which can significantly reduce performance, is limited tothe subset of processes that can access the objects beingcontrolled by a given manager. In essence, this creates aset of centralized lock managers that are operating inparallel.C. Object Based Caching SystemThe object-based caching system functions in the samemanner as traditional file system caches except that it isobjects rather than disk blocks that are cached. It takes anI/O request from an application process that is expressedin terms of a linear sequence of bytes and converts it to anequivalent request expressed in terms of objects. It thencarries out the request either on its own (the object isprivate) or in collaboration with other object managers(the object is shared).All of the processes that share a given file participate inthe object cache for that file. The cache buffer consists ofmemory from the participating processes plus anyavailable local disk space. The cache objects are createdwhen a shared file is opened, and the objects and cache forthat file are torn down when the file is closed. There is alocal object manager for each process participating in thecache. Once the objects are created, they are distributedamong the managers based on a cost model of assigning agiven object to a given process. The local managercontrols the meta-data for its objects and performs anyobject locking necessary to maintain global cachecoherence. Once the objects are created, all subsequentI/O operations are carried out in the cache (except in thecase of a sync() or close operation).Metadata and locking responsibilities for a given objectare maintained by (exactly) one of the object managerssharing the object. When a process wants to write into ashared object, the request is trapped by its local managerand forwarded to the appropriate lock manager. Once thelock has been acquired, the object can be written into therequesting processes’ local cache, or the object can bemodified and the updated object can be sent to the objectmanager that owns the object. In the first case, the write isperformed and the writing process becomes the newmanager for that object. In the latter case, ownership ofthe object is not changed. We are currently investigatingthe trade-offs associated with each approach.III.DYNAMIC REMAPPING OF OBJECT BASED I/OGiven an understanding of the basic system componentswe now focus on the dynamic remapping of object-basedI/O. We first show how objects are created, followed by adiscussion of how they are represented at runtime viainterval trees. We then provide an example demonstratinghow dynamic remapping can be performed.D. Object CreationThink of a file as represented by an integer line thatextends from 0 to n – 1, where n is the number of bytes inthe file. Given this representation of a file, a file view canbe thought of as a set of intervals on this integer line,where each interval represents the endpoints of a fileregion in which the owning process will operate. Theseendpoints are obtained from the file views, and divide theinteger line into a set of partitions termed elementaryintervals [19]. Each file view can contain multipleintervals, and as more intervals are placed on the integerline, more elementary intervals are created. Once all of theintervals (of all file views) have been added to the line,each of the resulting elementary intervals corresponds toan object.Figure 2 depicts object creation using this technique. Itshows the file views of three processes, an integer linerepresenting an 80-unit file, and how the endpointsassociated with the file views cut the line. For example,the first interval associated with the file view of processP0 partitions the line into two segments, 0 -14 and 15 –69. When the first interval associated with process P1 isadded to the line, it creates three new partitions: 10 – 14,15 – 25, and 26 – 29. The other partitions are similarlycreated, and an object is created for each resultingelementary interval.As can be seen, each shared object encompasses exactlythe overlapping file region within which contention canoccur. It is also evident that objects are defined such thatthey cannot overlap.

F. Interval Tree Example0 – 1460 - 7910 – 2950 - 6525 - 55010 1425 295055 606579Figure 3 shows the interval tree that would be constructedto store the objects created in Figure 2. The tree nodes arecreated as described above, splitting the difference at eachsuccessive level of the tree. For example, the root of thetree divides the range of the file (0 – 79) in half, with themidpoint 39 being stored in the node. The root’s left childdivides the range between the beginning of the file and theposition of the root node (0 – 39) in half, thus it hasposition 19.Position: 39Objects: O4Procs: 2Figure 2: Object Creation. As intervals are added to the integerline new elementary elements are created.E. Interval Search TreesThe advantage of creating objects in this manner is thatit provides the basis for building an efficient search treethat can store and retrieve information about the objectset. Because objects defined in this manner cannotoverlap, it is possible to build a simple binary search tree,organized by object intervals, which can provide theneeded functionality. Assuming that the tree is keptbalanced, this approach would provide O(log n r)lookup time, where n is the number of intervals (objects)stored in the tree, and r is the number of objects in thesearch result [8].We needed slightly more functionality than thatprovided by a simple binary tree, and use instead aninterval tree that stores the position on the integer line ateach node, along with all objects containing the interval.Assuming the interval tree is kept balanced, it would havethe same efficiency characteristics as those of a simplebinary tree.One of the most important issues with respect to binarytrees is the method used to keep it balanced. A simplestrategy for balancing is to choose a position for the rootnode that splits the integer line in half and recursivelysplits in half all of the sub-trees that are added. This wouldcontinue until all of the elementary intervals had beenadded. This simple strategy would result in a fairlybalanced tree in cases where the file objects are evenlydistributed, but has the potential for O(n) searchperformance in the worst case. There are many otheroptions that can also be considered, ranging fromprobabilistic techniques such as random ordering ofinsertions, to the use of a self balancing tree such as a redblack tree or an AVL tree [8]. We are currentlyinvestigating the performance of these and othertechniques.Position: 19Objects: O2Procs: 1Position: 9Objects: O0Procs: 0Position: 14Objects: O1Procs: 0,1Position: 59Objects: O6Procs: 1Position: 29Objects: O3Procs: 1,2Position: 49Objects: Procs: -Position: 54Objects: O5Procs: 1,2Position: 69Objects: O8Procs: 0Position: 64Objects: O7Procs: 0,1Figure 3: Interval Tree ExampleNow consider how the interval tree would be searchedto find all of the objects that contain the interval {18,35}.Starting at the root, the search interval is compared withthe node’s position. There are three possible outcomesfrom this comparison: the search range is entirely to theleft of the node’s position, the search range is entirely tothe right of the node’s position, or the search rangecontains the node’s position. Since the root node’sposition is entirely to the right of the search interval, theright sub-tree may be safely pruned from the search as itcannot contain any objects in the search range. Anyobjects stored at the root node (in this case, only O4) mustbe checked, as they may or may not intersect the searcharea. Since O4 is found to overlap the search interval it isadded to the result, and the left sub-tree is searchedrecursively.The search of the root’s left sub-tree starts at the nodewith position 19. Since position 19 falls within the searchinterval, the object at that node, O2, must overlap thesearch interval (they have at least byte 19 in common), soit may be immediately added to the result. Since thenode’s position falls within the search interval, both leftand right sub-trees must be searched recursively. Thesearch moves to the node with position 9, which falls tothe left of the search interval. O0 is checked and ignored,as it does not fall within the search interval. The node withposition 14 is searched recursively, and O1 is also ignored.Finally, the node with position 29 is checked. Since 29falls within the search interval, O3 is immediately added to

the result, completing the search.G. Using Interval Trees for Dynamic RemappingLogically, the dynamic remapping of object streams canbe thought of as consisting of three components: aproducer of objects, a consumer with a different objectstructure on the same file, and a translator that sitsbetween the two. The translator is a distributed applicationtha

Improving I/O Performance through the Dynamic Remapping of Object Sets Jeremy Logan 1, Phillip Dickens 2 1 University of Maine, Orono, Maine, USA, jeremy.logan@maine.edu 2 University of Maine, Orono, Maine, USA, dickens@umcs.maine.edu Abstract - Our research has been investigating a ne

Related Documents:

Texts of Wow Rosh Hashana II 5780 - Congregation Shearith Israel, Atlanta Georgia Wow ׳ג ׳א:׳א תישארב (א) ׃ץרֶָֽאָּהָּ תאֵֵ֥וְּ םִימִַׁ֖שַָּה תאֵֵ֥ םיקִִ֑לֹאֱ ארָָּ֣ Îָּ תישִִׁ֖ארֵ Îְּ(ב) חַורְָּ֣ו ם

in-depth understanding of business objectives, ABB helps to refine operational processes and create sustained performance improvement by: on project Exploiting state-of-the-art technology Improving business and supply chain performance Improving variable cost performance Optimizing the asset base Improving the contribution of operations to business

Improving Global Health Care Delivery Through Collaboration & Partnership: Grand Rounds May 2014 - UT Southwestern, Dallas, Texas Author: UT Southwestern Subject: Michelle Niescierenko, M.D. May 2014 Grand Rounds Presentation - Improving Global Health Care Delivery Through Collaboration & Partnership Created Date: 5/12/2014 2:48:55 PM

Improving pupils’ dining experience will have a positive impact on the uptake of school meals and the wider school day. The main benefit your school can expect from improving the dining experience for pupils is a happier and calmer population of children and young people. Improving the dining experience at lunch time will also:

Improving Safety in Women’s Facilities A contextual approach to improving security in women’s facilities Owen, Wells, Pollock, Muscat & Torres (NIJ Award # 2006-RP-BX-0016) Translating Research into Practice: Improving Safety in Women’s Facilities (McNabb) is a bulletin from the report Gendered Violence and Safety: A contextual approach to

Figure 10. Centrifugal Pump Performance Curves 37 Figure 11. Family of Pump Performance Curves 38 Figure 12. Performance Curves for Different Impeller Sizes 38 Figure 13. Performance Curves for a 4x1.5-6 Pump Used for Water Service 39 Figure 14. Multiple Pump Operation 44 Figure 15. Multiple-Speed Pump Performance Curves 45 Figure 16.

Update performance goals in the performance goal plan for the current performance period being evaluated in Perform2Achieve. New users will need to enter their performance goals. 2. Complete Annual Performance Appraisal. Enter Performance Goals for Next Year. 3. Complete the annual performance appraisal process using Perform2Achieve.

2 Page . Preface . The Academic Phrasebank is a general resource for academic writers. It aims to provide the phraseological ‘nuts and bolts’ of academic writing organised a