ADAM: A Decentralized Parallel Computer Architecture Featuring Fast .

1y ago
3 Views
2 Downloads
1.16 MB
256 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Grady Mosby
Transcription

ADAM: A Decentralized Parallel Computer Architecture FeaturingFast Thread and Data Migration and a Uniform HardwareAbstractionbyAndrew “bunnie” HuangSubmitted to the Department of Electrical Engineering and Computer Sciencein partial fulfillment of the requirements for the degree ofDoctor of Philosophyat theMASSACHUSETTS INSTITUTE OF TECHNOLOGYJune 2002c Massachusetts Institute of Technology 2002. All rights reserved.Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Department of Electrical Engineering and Computer ScienceMay 24, 2002Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Thomas F. Knight, Jr.Senior Research ScientistThesis SupervisorAccepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Arthur C. SmithChairman, Department Committee on Graduate Students

2

ADAM: A Decentralized Parallel Computer Architecture Featuring Fast Thread andData Migration and a Uniform Hardware AbstractionbyAndrew “bunnie” HuangSubmitted to the Department of Electrical Engineering and Computer Scienceon May 24, 2002, in partial fulfillment of therequirements for the degree ofDoctor of PhilosophyAbstractThe furious pace of Moore’s Law is driving computer architecture into a realm where the the speedof light is the dominant factor in system latencies. The number of clock cycles to span a chipare increasing, while the number of bits that can be accessed within a clock cycle is decreasing.Hence, it is becoming more difficult to hide latency. One alternative solution is to reduce latencyby migrating threads and data, but the overhead of existing implementations has previously mademigration an unserviceable solution so far.I present an architecture, implementation, and mechanisms that reduces the overhead of migration to the point where migration is a viable supplement to other latency hiding mechanisms,such as multithreading. The architecture is abstract, and presents programmers with a simple, uniform fine-grained multithreaded parallel programming model with implicit memory management.In other words, the spatial nature and implementation details (such as the number of processors) ofa parallel machine are entirely hidden from the programmer. Compiler writers are encouraged todevise programming languages for the machine that guide a programmer to express their ideas interms of objects, since objects exhibit an inherent physical locality of data and code. The machineimplementation can then leverage this locality to automatically distribute data and threads acrossthe physical machine by using a set of high performance migration mechanisms.An implementation of this architecture could migrate a null thread in 66 cycles – over a factorof 1000 improvement over previous work. Performance also scales well; the time required to movea typical thread is only 4 to 5 times that of a null thread. Data migration performance is similar, andscales linearly with data block size. Since the performance of the migration mechanism is on parwith that of an L2 cache, the implementation simulated in my work has no data caches and reliesinstead on multithreading and the migration mechanism to hide and reduce access latencies.Thesis Supervisor: Thomas F. Knight, Jr.Title: Senior Research Scientist3

4

AcknowledgmentsI would like to thank my parents for all their love and support over the years, and for their unwaxingencouragement and faith in my ability to finish the degree program.I would also like to thank my wonderful, loving, caring girlfriend Nikki Justis for all her support,motivation, patience, editing, soldering, discussion and idea refinement, cooking, cleaning, laundrydoing, driving me to campus, wrist massages, knowing when I need to tool and when I need to takea break, tolerance of my 7 AM sleep schedule, and for letting me make a mess in her room and takeover her couch with my whole computer setup.I would like to thank all my friends for their support over the years, and for making the pastdecade at MIT–and my first step into the real world–an exciting, fun and rewarding experience. Letthe rush begin.and may it never end.This thesis would never happened if it were not for the Aries Research Group (in order of seniority): Tom Knight, Norm Margolus, Jeremy Brown, J.P. Grossman, Josie Ammer, Mike Phillips,Peggy Chen, Bobby Woods-Corwin, Ben Vandiver, Tom Cleary, Dominic Rizzo, and Brian Ginsburg. Tom Knight, in particular, has been a role model for me since I came to the lab; he is anendless source of inspiration and knowledge, and has provided invaluable guidance, counsel andencouragement. He is brilliant and visionary, yet humble and very accessible, and always willing toanswer my questions, no matter how silly or stupid. I also really enjoy his laissez-faire policies withrespect to running the group; I truly treasure the intellectual freedom Tom brought to the group,and his immense faith in all of our abilities to manage and organize ourselves, and to “go forth andthink great thoughts.” Jeremy Brown and J.P. Grossman were also invaluable for their good ideas,lively conversation, and idea refinement. Jeremy invented the idempotent network protocol used inthis thesis, and his excellent thesis work in novel parallel programming methods and scalable parallel garbage collection fills in many crucial gaps in my thesis. J.P. and Jeremy also developed thecapability representation with SQUIDS that is central to my thesis. I also relied on J.P.’s excellentwork in researching and characterizing various network topologies and schemes; I used many of hisresults in my implementation. Bobby Woods-Corwin, Peggy Chen, Brian Ginsburg and DominicRizzo were invaluable in working out the implementation of the network. Without them, I wouldhave nothing to show for this thesis except for a pile of Java code. Two generations of M.Eng theses and two UROPs is a lot of work! Norm Margolus also helped lay down the foundations of thearchitecture with his work in spatial cellular automata machines and embedded DRAM processors.5

Finally, André DeHon, although not officially a part of the group, was very much instrumental tomy work in many ways. This work relies very heavily upon his earlier work at MIT on the METROnetwork. André also provided invaluable advice and feedback during his visits from Caltech.I would also like to give a special thanks to Ben Vandiver. Since the inception of the ADAMand the Q-Machine, Ben has furnished invaluable insight and feedback. The thesis would not becomplete without his synergism as a compiler writer, programmer, and high-caliber software hacker.I also thank him for his enthusiasm and faith in the architecture; his positive energy was essential inkeeping me from getting discouraged. He not only helped hash out the programming model for themachine, he also wrote two compilers for the machine along the way. He also was instrumental incoding and debugging the benchmarks used in the results section of my thesis.Krste Asanović and Larry Rudolph were also very important influences on this thesis. Krste isa wellspring of knowledge and uncannily sharp insight into the even the finest architectural details.Larry opened my mind to the field of competitive analysis and on-line algorithms, something Iwould never have considered otherwise. I also appreciate the critical review provided by both Krsteand Larry.I would also like to thank my friends at the Mobilian Corporation–in particular, Rob Gilmore,MaryJo Nettles, Todd Sutton, and Rob Wentworth–for their understanding, patience, and supportfor me finishing my degree.I thank the Xilinx Corporation for generously donating the many high-end FPGAs and designtools to the project, which were used to implement prototype network and processor nodes. I wouldalso like to thank the Intel Corporation and Silicon Spice for providing fellowships and equipmentthat enabled me to finish my work. Sun Corporation and Borland also provided me Java and JBuilderfor free, but the value of these tools cannot be underestimated. This work was also funded by the AirForce Research Laboratory, agreement number F30602-98-1-0172, “Active Database Technology”.I could go on, but unfortunately there is not enough space to name everyone who has helpedwith my thesis. This is to everyone who provided invaluable input and guidance into my thesis–thank you all. I am indebted to the world, and I can only hope to someday make a contribution thatis worthy.Finally, all mistakes in this thesis are mine.6

Contents1 Introduction171.1Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .181.2Organization of This Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .182 Background2.12.22.321Latency Management Techniques. . . . . . . . . . . . . . . . . . . . . . . . . .212.1.1Latency Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .212.1.2Latency Hiding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23Migration Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .252.2.1Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27Architectural Pedigree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .282.3.1Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .282.3.2Decoupled-Access/Execute . . . . . . . . . . . . . . . . . . . . . . . . . .302.3.3Processor-In-Memory (PIM) and Chip Multi-Processors (CMP) . . . . . .312.3.4Cache Only Memory Architectures . . . . . . . . . . . . . . . . . . . . .333 Aries Decentralized Abstract Machine3.13.235Introduction to ADAM by Code Example . . . . . . . . . . . . . . . . . . . . . .353.1.1Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .353.1.2Calling Convention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .363.1.3Memory Allocation and Access . . . . . . . . . . . . . . . . . . . . . . .38Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .403.2.1Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .413.2.2Queues and Queue Mappings . . . . . . . . . . . . . . . . . . . . . . . .423.2.3Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .437

3.2.4Interacting with Memory . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Migration Mechanism in a Decentralized Computing Environment44474.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .474.2Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .484.2.1Architectures that Directly Address Migration . . . . . . . . . . . . . . . .484.2.2Soft Migration Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . .504.2.3Programming Environments and On-Line Migration Algorithms . . . . . .52Migration Mechanism Implementation . . . . . . . . . . . . . . . . . . . . . . . .554.3.1Remote Memory Access Mechanism . . . . . . . . . . . . . . . . . . . .564.3.2Migration Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . .594.3.3Data Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .594.3.4Thread Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62Migration Mechanism Issues and Observations . . . . . . . . . . . . . . . . . . .674.4.1General Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .674.4.2Performance Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .694.34.45 Implementation of the ADAM: Hardware and Simulation715.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .725.2High-Level Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .725.3Leaf Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .735.3.1Processor Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .745.3.2Memory Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78Physical Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .805.4.1Technology Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . .805.4.2Design Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .825.46 Machine and Migration Characterization6.16.287Basic Q-Machine Performance Results . . . . . . . . . . . . . . . . . . . . . . . .876.1.1Memory Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . .886.1.2Basic Network Operations Performance . . . . . . . . . . . . . . . . . . .89Migration Performance and Migration Control: Simple Cases . . . . . . . . . . . .906.2.190Two Threads Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . .8

6.2.26.3Thread and Memory Benchmark . . . . . . . . . . . . . . . . . . . . . . .96Application Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006.3.1In-Place Quicksort Application . . . . . . . . . . . . . . . . . . . . . . . . 1006.3.2Matrix Multiplication Benchmark . . . . . . . . . . . . . . . . . . . . . . 1046.3.3N-Body Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077 Conclusions and Future Work1137.1Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1137.2Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147.37.2.1Improved Migration Control Algorithms . . . . . . . . . . . . . . . . . . . 1147.2.2Languages and Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . 1157.2.3Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 1167.2.4Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117A Acronyms119B ADAM Details123B.1 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123B.2 Instruction Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126B.3 Capability Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129B.4 Über-Capability and Multitasking . . . . . . . . . . . . . . . . . . . . . . . . . . 132B.5 Exception Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132C Q-Machine Details135C.1 Queue File Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 135C.1.1Physical Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136C.1.2State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139C.2 Network Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143C.3 Network Topology and Implementation . . . . . . . . . . . . . . . . . . . . . . . 148D Opcodes151D.1 General Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151D.2 Lazy Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1519

D.3 Instruction Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15210

List of Figures1-1 Overview of the abstraction layers in this thesis. Couatl and People are compilerswritten by Ben Vandiver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .202-1 Reachable chip area in top level metal, where area is measured in six-transistorSRAM cells. Directly from [AHKB00] . . . . . . . . . . . . . . . . . . . . . . . .242-2 Illustration of the false sharing problem. . . . . . . . . . . . . . . . . . . . . . . .283-1 Demonstration of the copy/clobber (@) modifier. . . . . . . . . . . . . . . . . . . .363-2 Simple code example demonstrating procedure linkage, thread spawning, memoryallocation, and memory access. . . . . . . . . . . . . . . . . . . . . . . . . . . . .373-3 Thread states after thread spawn and procedure linkage. . . . . . . . . . . . . . . .393-4 Thread states after memory allocation and access. . . . . . . . . . . . . . . . . . .403-5 Programming model of ADAM . . . . . . . . . . . . . . . . . . . . . . . . . . . .413-6 Structure of an ADAM thread42. . . . . . . . . . . . . . . . . . . . . . . . . . . .3-7 High-level breakdown of the ADAM capability format. Detailed bit-level breakdowns of each field can be found in appendix B. . . . . . . . . . . . . . . . . . . .444-1 Format of a remote memory capability’s shadow space in local virtual memory space. 574-2 System level view of resolving remote memory requests. . . . . . . . . . . . . . .584-3 Details of handling remote and local EXCH requests. . . . . . . . . . . . . . . . .584-4 Mechanism for temporarily freezing memory requests. . . . . . . . . . . . . . . .614-5 Handling of a migrated EXCH request with temporally bi-directional pointers. . . .634-6 Transmission line protocol for handling forwarding pointer updates on thread-mappedcommunications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .684-7 Overview of a demand-driven data propagation scheme. . . . . . . . . . . . . . . .7011

5-1 Pieces of a Q-Machine implementation. Node ID tags are uniform across the machine, so network-attached custom hardware is addressable like any processor ormemory node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .735-2 High level block diagram of a leaf node. . . . . . . . . . . . . . . . . . . . . . . .745-3 Detail of a processor node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .755-4 Hybrid scheduler list/I-cache structure. In this diagram, c42 and c10 are runnableand up for forwarding to the work-queue; as values for c55:q12 and c4:q4 arrive viathe NI, they will be promoted to runnable status. . . . . . . . . . . . . . . . . . . .5-5 High level block diagram of a memory node. . . . . . . . . . . . . . . . . . . .78795-6 Packaging and integration for a two-layer silicon high-performance chip multiprocessor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .825-7 Cartoon of the network layer layout. . . . . . . . . . . . . . . . . . . . . . . . . .835-8 Hypothetical layout of a single processor node. . . . . . . . . . . . . . . . . . . .845-9 Hypothetical layout of the tile processor chip. . . . . . . . . . . . . . . . . . . . .856-1 Screenshot of the ASS running a 64-node vector reverse regression test. On the leftis the machine overview; to the right is the thread debugger window. . . . . . . . .886-2 The two threads synthetic benchmark. Communication happens along the arcs; adata dependency is forced by printing the incoming data. . . . . . . . . . . . . . .916-3 Code used for the two thread benchmark. . . . . . . . . . . . . . . . . . . . . . .926-4 Measured speedup versus migration distance for the Two Threads benchmark. . . .936-5 Shape of the curvex c .x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .956-6 Length of message sequence required to amortize various migration overheads (M (d)).The baseline two messages per iteration for the Two Thread benchmark is alsomarked on the graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .966-7 The thread and memory synthetic benchmark. Communication happens along thearcs; a data dependency is forced by printing the incoming data. . . . . . . . . . .976-8 Code used for the thread-memory benchmark. . . . . . . . . . . . . . . . . . . . .986-9 Migration speedup versus migration decision time and memory capability size inthe thread and memory benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . .996-10 Cycles per iteration for Thread-Memory benchmark. d 4 in both cases. . . . . .996-11 Object method for the Quicksort benchmark written in People. . . . . . . . . . . . 10112

6-12 Distribution of migration times used in the Quicksort benchmark . . . . . . . . . . 1026-13 Plot of the load metric Tw versus time for the Quicksort benchmark with and withoutload balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036-14 Plot of the load balanced Quicksort benchmark with migration events overlayed. . . 1056-15 Portion of the streaming matrix multiply benchmark written in People. . . . . . . . 1066-16 Plot of the time required per iteration of a 100x100 matrix multiply over variousmigration conditions and coding styles. . . . . . . . . . . . . . . . . . . . . . . . 1086-17 Plot of the time required per iteration of a 15x15 matrix multiply over various migration conditions and coding styles. . . . . . . . . . . . . . . . . . . . . . . . . . 1086-18 Plot of the first few time steps of the N-Body benchmark output . . . . . . . . . . 1096-19 Inner-loop of N-Body benchmark code. . . . . . . . . . . . . . . . . . . . . . . . 1116-20 Plot of the time required per timestep of a 12-body N-body simulation run on a64-node Q-Machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111B-1 Data formats supported by ADAM . . . . . . . . . . . . . . . . . . . . . . . . . . 124B-2 Tag and Flag field details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125B-3 Format of ADAM opcodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127B-4 ADAM capability format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129B-5 Exception handling overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133C-1 A 3-write, 3-read port VQF implementation. pq log2 (#physical registers). Q-cache details omitted for clarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . 136C-2 PQF unit cell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137C-3 PQF read request response flowchart . . . . . . . . . . . . . . . . . . . . . . . . . 140C-4 PQF write request response flowchart . . . . . . . . . . . . . . . . . . . . . . . . 142C-5 Details of the network interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . 144C-6 Idempotence and reliable data delivery protocol in detail for a single transaction.Lines in gray are “retry” lines that would not happen in an ideal setting. . . . . . . 146C-7 Details of packet formats. Note that in the destination/source cID and queue headers, it is very important that the processor ID be in the MSB and co-located with theaddress field, since implementations may push bits between the address and PIDfields to increase the number of routable processor nodes or to increase the amountof memory per node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14713

D-1 qb format for the PARCEL instruction . . . . . . . . . . . . . . . . . . . . . . . . 24014

List of Tables5.1Extrapolated Technology Parameters for 2010. All values from [CI00a] unless otherwise noted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81A.1 Table of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120A.2 Table of Acronyms, continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12115

16

Chapter 1IntroductionYou can’t fake memory bandwidth that isn’t there.—Seymour Cray on why the Cray-1 had no cachesMost data and thread migration mechanisms to date are slow when compared to other latency management techniques. This thesis introduces an architecture, ADAM, that enables a simple hardwareimplementation of data and thread migration. This implementation reduces the overhead of migration to the point where it is comparable to other hardware-assisted latency management techniques,such as caching.Data migration is useful to reduce access latencies in situations where the working set is largerthan cache. It is also useful in reducing or redistributing network traffic in situations where hotspotsare caused by contention for multiple data objects. Data migration can also be used to emulate thefunction of caches in systems that feature no data caches.Thread migration is useful to reduce access latencies in situations where multiple threads arecontending for a single piece of data. Like data migration, it is also useful in situations wherehotspots can be alleviated by redistributing the sources and destinations of network traffic. Threadmigration is also useful for load-balancing, particularly in situations where memory contention islow.Data and thread migration can be used together to help manage access latencies in situationswhere many threads are sharing information in an unpredictable fashion among many pieces ofdata, as might be the case in an enterprise database application. Data and thread migration can alsobe used to enhance system reliability as well, if faults can be predicted far enough in advance sothat the failing node can be flushed of its contents.17

1.1 ContributionsThe primary contribution of my thesis is a fast, low-overhead data and thread migration mechanism. In terms of processor cycles, the mechanism outlined in my thesis represents greater than a1000-fold increase in performance over previous software-based migration mechanisms. As a result,data and thread migration overheads are similar to L2 cache fills on a conventional uni-processorsystem.The key architectural features that enable my data and thread migration mechanisms are a unified thread and data representation using capabilities and interthread communication andmemory access through architecturally explicit queues. Threads and data in my architecture,ADAM, are accessed using a capability representation with tags that encode base and bounds information. In other words, every pointer has associated with it the region of data it can access, and thisinformation trivializes figuring out what to move during migration. Architecturally explicit queues,on the other hand, simplify many of the ancillary tasks associated with migrating threads and data,such as the movement of stacks, the migration and placement of communication structures, concurrent access to migrating structures, and pointer updates after migration.My thesis also describes an implementation outline of ADAM dubbed the “Q-Machine”. Theimplementation technology is presumed to be 35 nm CMOS silicon, available in volume around2010, and features no data caches; instead, it relies on the migration mechanism and multithreadingto maintain good performance and high processor utilization. The proposed implementation is simulated with the ADAM System Simulator (ASS); it is this simulator that provides the results uponwhich the ADAM architecture is evaluated. Note that there is no requirement for advanced technology to implement the ADAM; one could make an ADAM implementation today, if so desired.The 2010 technology point was chosen to evaluate the ADAM architecture because it would matcha likely tape-out time frame of the architecture’s implementation.1.2 Organization of This WorkChapter 2, “Background”, discusses some of the advantages and disadvantages of a migrationscheme over more conventional latency management schemes. It also reviews, at a high level, someof the problems encountered in previous migration schemes; a more detailed review of migrationmechanisms is presented in Chapter 4. Chapter 2 closes with a differentiation of this work fromits predecessors in a brief discussion of the architectural pedigree of the ADAM and its Q-Machine18

implementation.Chapter 3, “Aries Decentralized Abstract Machine”, describes the ADAM in detail. This chapterlays the foundation for the programming model of the ADAM through a simple code example,followed up with a discussion of the architectural details relevant to a migration implementation. Adetailed discussion of other architectural features can be found in Appendix B.Chapter 4, “Migration Mechanism in a Decentralized Computing Environment”, presents theimplementation of the migration mechanisms. The chapter begins with a survey of previous workinvolving data and thread migration; this survey includes both mechanisms and migration control algorithms, since their implementation details are intimately associated. I then describe the migrationmechanism in detail.Chapter 5, “Implementation of the ADAM: Hardware and Simulation”, describes an implementation of ADAM. This implementation is known as the Q-Machine. This chapter summarizes themachine organization and implementation technology assumptions of the simulator used to evaluatemy migration mechanisms.In the next chapter, “Machine and Migration Characterization” (Chapter 6), I characterize theperformance of the implementation. The chapter starts with two simple micro-kernel benchmarksand some formal analysis of the migration mechanism. Then, I present results for some morecomprehensive benchmarks, Quicksort, Matrix Multiply and N-Body, with simple migration controlheuristics driving the migration mechanisms.The thesis concludes in chapter 7 with a discussion of further developments for the ADAMarchitecture, areas for improvement and further research, and programming languages for the machine. Note that while a detailed discussion of programming languages for the ADAM is outside thescope of this thesis, I did not work in a programming language vacuum. A strong point of using anabstract machine model is that compiler writers can begin their work on day one, and in fact, that isthe case. Benjamin Vandiver, an M.Eng student in my research group, has developed two languages,Couatl and People, and compilers for these languages to the ADAM architecture. Couatl is a basicobject-oriented language that we used in the early stages of architecture development to hammerout the abstract machine model and to determine the unique strengths and weaknesses of a queuebased architecture. The follow-on language, People, is a more sophisticated language supportingstr

ADAM: A Decentralized Parallel Computer Architecture Featuring Fast Thread and Data Migration and a Uniform Hardware Abstraction by Andrew "bunnie" Huang Submitted to the Department of Electrical Engineering and Computer Science on May 24, 2002, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract

Related Documents:

Prophet Adam (alayhi salam) Activity 7 If Adam looks at the tree, he will never die. If Adam eats from the tree, he will be able to fly.O If Adam eats from the tree, he will never die. If Adam looks at the tree, he will be able to fly. Adam felt angry. Adam felt ill. Adam felt tired. Adam felt sorry. Allah forgave Adam, but sent him to live on .

Primary Author(s): John Schwamb, Adam Moran; Primary Editor(s): John Schwamb, Adam Moran 2.2 The Seven Hills Foundation Primary Author(s): Adam Moran; Primary Editor(s): Adam Moran 2.3 Assistive Technology: Apps Primary Author(s): Adam Moran; Primary Editor(s): Adam Moran 2.4 How People Search for and Rate Mobile Apps

Your parallel port scanner connects to any available parallel (LPT) port. Check your computer 's manual for the parallel port locations. To connect the parallel port scanner: 1. Save any open files, then shut down the power to your computer. 2. If a printer cable is attached to your computer's parallel port, unplug the cable from the computer.

Adam of the Road By Elizabeth Janet Gray Chapters 1-2 Adam - Nick Before you read the chapter: The protagonist in most novels features the main character or “good guy”. The protagonist of Adam of the Road is Adam Quartermayne, an eleven-year-old boy who experiences many exciting adventures as the novel unfolds.

Contents Diaries of Adam and Eve 1 The Diary of Adam and Eve 3 Extract from Eve’s Autobiography 31 Passage from Eve’s Autobiography 45 That Day in Eden 51 Eve Speaks 59 Adam’s Soliloquy 65 A Monument to Adam

The Analysis Data Model Implementation Guide (ADaMIG) v1.1 defines three different types of datasets: analysis datasets, ADaM datasets, and non-ADaM analysis datasets: Analysis dataset - An analysis dataset is defined as a dataset used for analysis and reporting. ADaM dataset - An ADaM dataset is a particular type of analysis dataset that .

This new step involves the review of ADaM specifications specifically focusing on CDISC compliance (see Figure 2). Figure 2. Revised Work Flow By adding a review of the ADaM specs, you can begin to check for compliance against a subset of the CDISC ADaM rules earlier in the work flow which should result in fewer findings at the final review once

second grade levels J/K/L , feature series for readers to study character. Teachers will want to spend the time to set up the Teachers will want to spend the time to set up the classroom library to showcase characters, no matter the reading levels of their readers.