Chapter 5 Multiprocessors And Thread-Level Parallelism

2y ago
13 Views
3 Downloads
3.15 MB
99 Pages
Last View : 14d ago
Last Download : 3m ago
Upload by : Ciara Libby
Transcription

Computer ArchitectureA Quantitative Approach, Fifth EditionChapter 5Multiprocessors andThread-Level ParallelismCopyright 2012, Elsevier Inc. All rights reserved.1

Contents1.2.3.4.5.6.IntroductionCentralized SMA – shared memory architecturePerformance of SMADMA – distributed memory architectureSynchronizationModels of ConsistencyCopyright 2012, Elsevier Inc. All rights reserved.2

1. . Introduction. Why multiprocessors? Need for more computing power Several ways to increase processor performance Data intensive applicationsUtility computing requires powerful processorsIncreased clock rate limited abilityArchitectural ILP, CPI – increasingly more difficultMulti-processor, multi-core systems more feasible based oncurrent technologiesAdvantages of multiprocessors and multi-core Replicationrather than unique design.Copyright 2012, Elsevier Inc. All rights reserved.3

Symmetric multiprocessors(SMP) IntroductionMultiprocessor typesShare single memory withuniform memory access/latency(UMA)Small number of coresDistributed shared memory(DSM) Memory distributed amongprocessors. Non-uniformmemory access/latency (NUMA)Processors connected via direct(switched) and non-direct (multihop) interconnection networksCopyright 2012, Elsevier Inc. All rights reserved.4

Important ideas Technology drives the solutions. Computing and communication deeply intertwined. Multi-cores have altered the game!!Thread-level parallelism (TLP) vs ILP.Write serialization exploits broadcast communication on theinterconnection network or the bus connecting L1, L2, and L3 cachesfor cache coherence.Access to data located at the fastest memory level greatlyimproves the performance.Caches are critical for performance but create new problems 1.2.Cache coherence protocols:Cache snooping traditional multiprocessorDirectory based multi-core processorsCopyright 2012, Elsevier Inc. All rights reserved.5

Review of basic concepts Cache smaller, faster memory which stores copies of the data fromfrequently used main memory locations.Cache writing policies Caches are organized in blocks or cache lines.Cache blocks consist of write-through every write to the cache causes a write to main memory.write-back writes are not immediately mirrored to main memory. Locationswritten are marked dirty and written back to the main memory only when thatdata is evicted from the cache. A read miss may require two memoryaccesses: write the dirty location to memory and read new location frommemory.Tag contains (part of) address of actual data fetched from main memoryData blockFlags dirty bit, shared bit,Broadcast networks all nodes share a communication mediaand hear all messages transmitted, e.g., bus.Copyright 2012, Elsevier Inc. All rights reserved.6

Coherence Reads by any processor must return the most recently written valueWrites to the same location by any two processors are seen in thesame order by all processorsConsistency A read returns the last value writtenIf a processor writes location A followed by location B, anyprocessor that sees the new value of B must also see the newvalue of ACopyright 2012, Elsevier Inc. All rights reserved.Centralized Shared-Memory ArchitecturesCache coherence; consistency7

Distribute the workload among a set of concurrently runningthreads.Uses MIMD model multiple program countersTargeted for tightly-coupled shared-memory multiprocessorsTo be effective need n threads for n processors.Amount of computation assigned to each thread grain size IntroductionThread-level parallelism (TLP)Threads can be used for data-level parallelism, but the overheadsmay outweigh the benefitSpeedup Maximum speedup with n processors is n; embarrassingly parallelThe actual speedup depends on the ratio of parallel versussequential portion of a program according to Amdahl’s law.Copyright 2012, Elsevier Inc. All rights reserved.8

TLP and ILP The costs for exploiting ILP are prohibitive in terms of siliconarea and of power consumption.Multicore processor have altered the game Shifted the burden for keeping the processor busy from the hardwareand architects to application developers and programmers.Shift from ILP to TLPLarge-scale multiprocessors are not a large market, theyhave been replaced by clusters of multicore systems.Given the slow progress in parallel software development inthe past 30 years it is reasonable to assume that TLP willcontinue to be very challenging.dcm9

Multi-core processors Cores are now the building blocks of chips.Intel offers a family of processors based on the Nehalem architecturewith a different number of cores and L3 cachesCopyright 2012, Elsevier Inc. All rights reserved.10

TLC - exploiting parallelism Speed-up Execution time with one thread / Execution time with Nthreads.Amdahl’s insight: Depends on the ratio of parallel to sequential execution blocks. It is not sufficient to reduce the parallel execution time e.g., byincreasing the number of threads. It is critical to reduce the sequentialexecution time!!Copyright 2012, Elsevier Inc. All rights reserved.11

Problem What fraction of a computation can be sequential if wewish to achieve a speedup of 80 with 100 processors?Copyright 2012, Elsevier Inc. All rights reserved.12

Solution According to Amdahl’s lawThe parallel fraction is 0.9975. This implies that only 0.25% of thecomputation can be sequential!!!Copyright 2012, Elsevier Inc. All rights reserved.13

DSM Pros:1.2. Cost-effective way to scalememory bandwidth if mostaccesses are to local memoryReduced latency of localmemory accessesCons1.2.Communicating data betweenprocessors more complexMust change software to takeadvantage of increasedmemory bandwidthdcm- mh14

Slowdown due to remote accessA multiprocessor has a 3.3 GHz clock (0.3 nsec) and CPI 0.5 whenreferences are satisfied by the local cache. A processor stalls for a remoteaccess which requires a 200nsec. How much faster is an application whichuses only local references versus when 0.2% of the references are remote?Copyright 2012, Elsevier Inc. All rights reserved.15

2. Data access in SMA Caching data Two distinct data states reduces the access time but demands cache coherenceGlobal state defined by the data in main memoryLocal state defined by the data in local cachesIn multi-core L3 cache is shared; L1 and L2 caches are privateCache coherence defines the behavior of reads and writes tothe same memory location. The value that should be returned by aread is the most recent value of the data item. Cache consistency defines the behavior of reads and writeswith respect to different memory locations. It determines when awritten value will be returned by a read .Copyright 2012, Elsevier Inc. All rights reserved.16

Conditions for coherenceA read of processor P to location X1.2.That follows a write by P to X, with no other processor writing to X betweenthe write and read executed by P, should return the value written by P.That follows a write by another processor Q to X, should return the valuewritten by Q provided that: there is sufficient time between the write and the read operations There is no other write to XWrites to the same location are serialized. If P and Q write to X in this order some processors may see the value writtenby Q before they see the value written by P.Copyright 2012, Elsevier Inc. All rights reserved.17

dcm - mh18

Copyright 2012, Elsevier Inc. All rights reserved.Centralized Shared-Memory ArchitecturesProcessors may see different values through their caches19

Coherence Reads by any processor must return the most recently written valueWrites to the same location by any two processors are seen in thesame order by all processorsConsistency A read returns the last value writtenIf a processor writes location A followed by location B, anyprocessor that sees the new value of B must also see the newvalue of ACopyright 2012, Elsevier Inc. All rights reserved.Centralized Shared-Memory ArchitecturesCache coherence; consistency20

Coherent caches provide: Migration: movement of data Replication: multiple copies of dataCache coherence protocols Directory-based Sharing status of each block kept in the directory Snooping Each core tracks sharing status of each blockCopyright 2012, Elsevier Inc. All rights reserved.Centralized Shared-Memory ArchitecturesEnforcing coherence21

Directory-based cache coherence protocols All information about theblocks is kept in thedirectory.SMP multiprocessors:one centralized directoryDirectory is located1.2.In the outmost cache formulti-core systems.In main memoryDSM multiprocessors: distributed directory. More complex each node maintains a directory which tracks the sharinginformation of every cache line in the nodedcm - mh22

Communication between private and sharedcaches Multi-core processor a bus connects private L1 and L2instruction (I) and data (D) caches to the shared L3 cache.To invalidate a cached item the processor changing thevalue must first acquire the bus and then place the addressof the item to be invalidated on the bus.DSM Locating the value of an item is harder for write-backcaches because the current value of the item can be in thelocal caches of another processor.dcm23

Two strategies:1.Write invalidate on write, invalidate all other copies. 2.Used in modern microprocessorsExample: a write-back cache during read misses of item X, processorsA and B. Once A writes X it invalidates the B’s cache copy of XCentralized Shared-Memory ArchitecturesSnoopy coherence protocolsWrite update or write broadcast update all cached copiesof a data item when the item is written Consumes more bandwidth thus not used in recent multiprocessorsCopyright 2012, Elsevier Inc. All rights reserved.24

Implementation of cache invalidate All processors snoop on the bus.To invalidate the processor changing an item acquires the bus andbroadcasts the address of the item.If two processors attempt to change at the same time the busarbitrator allows access to only one of them.How to find the most recent value of a data item Write-through cache the value is in memory but write buffers couldcomplicate the scenario.Write-back cache harder problem, the item could be in the privatecache of another processor.A block of cache has extra state bits Valid bit – indicates if the block is valid or notDirty bit - indicates if the block has been modifiedShared bit – cache block is shared with other processorsCopyright 2012, Elsevier Inc. All rights reserved.25

MSI – Modified, Shared, Invalid protocolEach core of a multi-core or each CPU runs a cache controller Responds to requests coming from two sources 1.2. Implements a finite-state machineThe states of a cache block1.2.3. The local coreThe bus (or other broadcast network connecting caches and memory)Invalid another core has modified the blockShared the block is shared with other coresModified the block in private cache has been updated by the local coreWhen an item in a block is referenced (read or write): Hit The block is availableMiss The bloc is not availableCopyright 2012, Elsevier Inc. All rights reserved.26

cmu27

Copyright 2012, Elsevier Inc. All rights reserved.Centralized Shared-Memory ArchitecturesMSI - snoopy coherence protocol28

Copyright 2012, Elsevier Inc. All rights reserved.Centralized Shared-Memory ArchitecturesMSI snoopy coherence protocol29

ProblemHow can the snoopingprotocol with the statediagram at the right can bechanged for a write-throughcache?What is the major hardwarefunctionality that is notneeded with a writethrough cache comparedwith a write back cache?Copyright 2012, Elsevier Inc. All rights reserved.30

SolutionA write to a block in the valid orthe shared state causes a writeinvalidate broadcast to flush theblock from other caches and moveto an exclusive state.We leave the exclusive statethrough either an invalidate fromanother processor or a read missgenerated by the CPU when ablock is displaced from cache byanother blockWe are moved from the sharedstate only by a write from the CPUor an invalidate from from anotherprocessorCopyright 2012, Elsevier Inc. All rights reserved.31

Solution (cont’d)When another processor writesa block that is resident in ourcache, we unconditionallyinvalidate the correspondingblock in our cache. Thisensures that the next time weread the data, we will load theupdated value of the block frommemory.Whenever the bus sees a readmiss, it must change the stateof an exclusive block to sharedas the block is no longerexclusive to a single cache.Copyright 2012, Elsevier Inc. All rights reserved.32

Solution (cont’d) It is not possible for valid cache blocks to be incoherent with respectto main memory in a system with write-through caches. The major change introduced in moving from a write-back to writethrough cache is the elimination of the need to access dirty blocks inanother processor’s caches. The write-through protocol it is no longer necessary to provide thehardware to force write back on read accesses or to abort pendingmemory accesses. As memory is updated during any write on a write-through cache, aprocessor that generates a read miss will always retrieve the correctinformation from memory.Copyright 2012, Elsevier Inc. All rights reserved.33

Complications for the basic MSI protocol: Operations are not atomic E.g. detect miss, acquire bus, receive a responseCreates possibility of deadlock and racesOne solution: processor that sends invalidate can hold busuntil other processors receive the invalidateExtensions: MESI protocol add exclusive state to indicate when a cleanblock is resident in only one cache. Prevents the need to write invalidate on a writeMOESI protocol Owned state; indicates that the block is ownedby that cache and it is out-of-date in memoryCopyright 2012, Elsevier Inc. All rights reserved.Centralized Shared-Memory ArchitecturesExtensions to MSI protocol34

cmu35

Shared memory bus andsnooping bandwidth isbottleneck for scalingsymmetric multiprocessors Duplicating tagsPlace directory in outermostcacheUse crossbars or point-to-pointnetworks with banked memoryCopyright 2012, Elsevier Inc. All rights reserved.Centralized Shared-Memory ArchitecturesCoherence protocols: extensions36

AMD Opteron: Memory directly connected to each multicore chip in NUMA-likeorganizationImplement coherence protocol using point-to-point linksUse explicit acknowledgements to order operationsCopyright 2012, Elsevier Inc. All rights reserved.Centralized Shared-Memory ArchitecturesCoherence protocols37

Coherence influences cache miss rateCoherence misses True sharing misses Write to shared block (transmission of invalidation) Read an invalidated blockFalse sharing misses A block is invalidated because some word in the block otherthan the one read was written into. A subsequent reference tothe block causes a missCopyright 2012, Elsevier Inc. All rights reserved.Performance of Symmetric Shared-Memory MultiprocessorsTrue and false sharing misses38

x1 and x2 are in thesame block in theshared state in thecaches of P1 and P2.Identify the true andfalse sharing missesin each of the fivesteps.Prior to time step 1 ,x1 was read by P2.Copyright 2012, Elsevier Inc. All rights reserved.39

Problem How to change the code of an application to avoid falsesharing?What can be done by a compiler and what requiresprogrammer directives?Copyright 2012, Elsevier Inc. All rights reserved.40

Solution False sharing occurs when both the data object size is smaller than thegranularity of cache block valid bit(s) coverage and more than one dataobject is stored in the same cache block frame in memory.Two ways to prevent false sharing. 1.2. Changing the cache block size or the amount of the cache block covered bya given valid bit are hardware changes and shall not be discussed.Software solution allocate data objects so thatonly one truly shared object occurs per cache block frame in memory andno non-shared objects are located in the same cache block frame as anyshared object. If this is done, then even with just a single valid bit per cacheblock, false sharing is impossible.Shared, read-only-access objects could be combined in a single cacheblock and not contribute to the false sharing problem because such acache block can be held by many caches and accessed as neededwithout an invalidations to cause unnecessary cache misses.Copyright 2012, Elsevier Inc. All rights reserved.41

Solution (cont’d) If shared data objects are explicitly identified in the program sourcecode, then the compiler should, with knowledge of memory hierarchydetails, be able to avoid placing more than one such object in acache block frame in memory. If shared objects are not declared,then programmer directives may need to be added to the program.The remainder of the cache block frame should not contain data thatwould cause false sharing misses. The sure solution is to pad withblock with non-referenced locations.Padding a cache block frame containing a shared data object withunused memory locations may lead to rather inefficient use ofmemory space. A cache block may contain a shared object plusobjects that are read-only as a trade-off between memory useefficiency and incurring some false-sharing misses. This optimizationalmost certainly requires programmer analysis to determine if itwould be worthwhile. A careful attention to data distribution withrespect to cache lines and partitioning the computation acrossprocessors is needed.Copyright 2012, Elsevier Inc. All rights reserved.42

Types of cache misses: the three C’s 1. Compulsory on the first access to a block; the block mustbe brought into the cache; also called cold start misses, or firstreference misses. 2. Capacity blocks are being discarded from cache becausecache cannot contain all blocks needed for program execution(working set is much larger than cache capacity). 3. Conflict also called collision misses or interference missesoccur when several blocks are mapped to the same set or blockframe in the case of set associative or direct mapped blockplacement strategiesdcm43

3. SMP performance – the workload1.2.3.OLTEP (on-line transaction processing system) Client processesgenerate requests and servers process them. Oracle database.Server processes consume 85% of user time. The server processesblock for I/O after about 25,000 instructions.DSS (decision support system) 6 queries average about 1.5million instructions before blocking. Oracle database.Alta Vista (a Web search engine). Alta Vista 200 GB databaseCopyright 2012, Elsevier Inc. All rights reserved.44

Copyright 2012, Elsevier Inc. All rights reserved.45

OLTEP - the effect of L3 cache sizeContribution to cache misses1.Instruction execution2.L2/L3 cache access3.Memory access4.PAL code instructionsexecuted in kernel mode.Execution time improves as L3cache size grows from 1 to 2MB.The idle time grows as cachesize increases, as fewermemory stalls occur and moreprocessors are needed tocover the I/O bandwidthCopyright 2012, Elsevier Inc. All rights reserved.46

Memory access cyclescontributing to L3 miss rate1.Instruction – decreasesas the L3 cacheincreases2.Capacity/conflict –decreases as the L3cache increases3.Compulsory – almostconstant4.False sharing – almostconstant5.True sharing – almostconstantCopyright 2012, Elsevier Inc. All rights reserved.Performance of Symmetric Shared-Memory MultiprocessorsOLTEP - factors contributing to L3 miss rate47

2MB, two-way associative cache.The memory access cycles perinstruction increase with thenumber of processors.The true sharing miss rateincreasesCopyright 2012, Elsevier Inc. All rights reserved.Performance of Symmetric Shared-Memory MultiprocessorsOLTEP – the effect of number of processors48

2MB, two-way associativecache.As the cache block sizeincreases the true sharingmisses decrease.Copyright 2012, Elsevier Inc. All rights reserved.Performance of Symmetric Shared-Memory MultiprocessorsThe effect of cache block size49

Andrew benchmark Emulates a software development environment.Parallel version of the make Unix command executed on 8 processors. Creates 203 processes787 disk requests on three different file systemsRuns 5.24 seconds on 128 MB of memory, no paging.Copyright 2012, Elsevier Inc. All rights reserved.50

Copyright 2012, Elsevier Inc. All rights reserved.51

Copyright 2012, Elsevier Inc. All rights reserved.52

Copyright 2012, Elsevier Inc. All rights reserved.53

Problem Compare the three approaches for performance evaluationof multiprocessor systems:1.2.3.Analytical modelling – use mathematical expressions to model thebehavior of the systemsTrace-driven simulation – run applications on a real machine andgenerate a file of relevant events. The traces are then replayedusing cache simulators when parameters of the system arechanged.Execution-driven simulators simulate the entire executionmaintaining an equivalent structure for the processor state.Copyright 2012, Elsevier Inc. All rights reserved.54

SolutionAnalytical models Can be used to derive high-level insight on the behavior of the system ina very short time. The biggest challenge is in determining the values of the parameters. While the results from an analytical model can give a goodapproximation of the relative trends to expect, there may be significanterrors in the absolute predictions.Copyright 2012, Elsevier Inc. All rights reserved.55

Solution (cont’d)Trace-driven simulations Typically have better accuracy than analytical models. This approachcan be fairly accurate when focusing on specific components of thesystem (e.g., cache system, memory system, etc.). Need more time to produce results. Does not model the impact of aggressive processors (mispredictedpath) and may not model the actual order of accesses with reordering. Traces can also be very large, often taking gigabytes of storage, anddetermining sufficient trace length for trustworthy results is important. Hard to generate representative traces from one class of machines thatwill be valid for all the classes of simulated machines. Hard to model synchronization without abstracting the synchronizationin the traces to their high-level primitives.Copyright 2012, Elsevier Inc. All rights reserved.56

Solution (cont’d)Execution-driven simulation1.models all the system components in detail and is consequentlythe most accurate of the three approaches.2.The speed of simulation is much slower than that of the othermodels.3.In some cases, the extra detail may not be necessary for theparticular design parameter of interest.Copyright 2012, Elsevier Inc. All rights reserved.57

Problem Devise a multiprocessor/cluster benchmark whoseperformance gets worse as processors are added.Copyright 2012, Elsevier Inc. All rights reserved.58

Solution Create the benchmark such that all processors updatethe same variable or small group of variables continuallyafter very little computation.For a multiprocessor, the miss rate and the continuousinvalidates in between the accesses may contribute moreto the execution time than the actual computation andadding more CPU’s could slow the overall executiontime.For a cluster organized as a ring communication costsneeded to update the common variables could lead toinverse linear speedup behavior as more processors areadded.Copyright 2012, Elsevier Inc. All rights reserved.59

4. DSM – distributed shared memory Cache snooping not scalable the bandwidth of the interconnectionnetwork not sufficient when the number of processors increasesExample: Distributed directory Four 4-core processors running at 4 GHzAble to sustain one reference per clock cycleMost bus traffic due to cache coherence traffic increasing cache sizedoes not help!!The bus bandwidth required 170 GB/sec far beyond the 4 GB/sec a modernbus could accommodate.One entry per memory blockAmount of information – (number of memory blocks) x (number of nodes)A directory-based coherence protocol must handlea read miss2.a write to a shared clean cache blockA write miss to a shared block is a combination of (1) and (2)1.Copyright 2012, Elsevier Inc. All rights reserved.60

Directory keeps trackof every block Which caches haveeach blockDirty status of eachblockImplemented in sharedL3 cache Keep bit vector ofsize # cores foreach block in L3Not scalable beyondshared L3Implemented in adistributed fashionCopyright 2012, Elsevier Inc. All rights reserved.Distributed Shared Memory and Directory-Based CoherenceDirectory protocols61

For each block, maintain state:1.2.3. Shared One or more nodes have the block cached, value inmemory is up-to-dateUncached No node has a copy of the cached blockModified Exactly one node has a copy of the cache block, valuein memory is out-of-date. Need Owner IDNeed to track which node has a copy of every blockDirectory maintains block states and sends invalidationmessagesTo keep track which nodes have copies of a block of cache Bit vector with one bit per memory block maintained by every nodeCan also identify the owner of each blockCopyright 2012, Elsevier Inc. All rights reserved.Distributed Shared Memory and Directory-Based CoherenceDirectory protocols62

Distributed Shared Memory and Directory-Based CoherenceMessagesCopyright 2012, Elsevier Inc. All rights reserved.63

State transition for an individualcache block. Requests from1.Local processor black2.Home directory grayStates similar to those in thesnoopy case but1.Explicit-invalidate2.Write-backreplace write misses.An attempt to write a sharedcache block is treated as amiss.Copyright 2012, Elsevier Inc. All rights reserved.Distributed Shared Memory and Directory-Based CoherenceDirectory protocols64

For uncached block: For shared block: Read miss Requesting node is sent the requested data and ismade the only sharing node, block is now shared.Write miss The requesting node is sent the requested data andbecomes the sharing node, block is now exclusiveRead miss The requesting node is sent the requested data frommemory, node is added to sharing setWrite miss The requesting node is sent the value, all nodes inthe sharing set are sent invalidate messages, sharing set onlycontains requesting node, block is now exclusiveFor exclusive block: Read miss the owner is sent a data fetch message, blockbecomes shared, owner sends data to the directory, data writtenback to memory, sharers set contains old owner and requestorData write back block becomes un-cached, sharer set is emptyWrite miss message is sent to old owner to invalidate and sendthe value to the directory, requestor becomes new owner, blockremains exclusiveCopyright 2012, Elsevier Inc. All rights reserved.Distributed Shared Memory and Directory-Based CoherenceDirectory protocols65

Copyright 2012, Elsevier Inc. All rights reserved.66

5. Synchronization Synchronization necessary for coordination ofcomplex activities and multi-threading.Atomic actions actions that cannot be interruptedLocks mechanisms to protect a critical section, codethat can be executed by only one thread at a timeHardware support for locksThread coordination multiple threads of a thread groupneed to act in concertMutual exclusion only one thread at a time should beallowed to perform an actionDeadlocksPriority inversionCopyright 2012, Elsevier Inc. All rights reserved.67

Atomic actions Must take special precautions for handling shared resources.Atomic operation a multi-step operation should be allowed toproceed to completion without any interruptions and should notexpose the state of the system until the action is completed.Hiding the internal state of an atomic action reduces the number ofstates a system can be in thus, it simplifies the design andmaintenance of the system.Atomicity requires hardware support: Test-and-Set instruction which writes to a memory location andreturns the old content of that memory cell as non-interruptible.Compare-and-Swap instruction which compares the contents of amemory location to a given value and, only if the two values are thesame, modifies the contents of that memory location to a given newvalue.dcm68

All-or-nothing atomicity Either the entire atomic action is carried out, or the system is left in thesame state it was before the atomic action was attempted;a transaction is either carried out successfully, or the record targeted bythe transaction is returned to its original state.Two phases: Pre-commit during this phase it should be possible to back up from itwithout leaving any trace. Commit point - the transition from the first to thesecond phase. During the pre-commit phase all steps necessary to preparethe post-commit phase, e.g., check permissions, swap in main memory allpages that may be needed, mount removable media, and allocate stackspace must be carried out; during this phase no results should be exposedand no actions that are irreversible should be carried out.Post-commit phase should be able to run to completion. Sharedresources allocated during the pre-commit cannot be released until afterthe commit point.dcm69

CommitCommittedNew actionPendingDiscardedAbortAbortedThe states of an all-or-nothing action.dcm70

Before-or-after atomicity The effect of multiple actions is as if these actions haveoccurred one after another, in some order.dcm71

Locks Locks shared variables which acts as a flag to coordinate access to ashared data. Manipulated with two primitives ACQUIRE RELEASESupport implementation of before-or-after actions; only one thread canacquire the lock, the others have to wait.All

9 TLP and ILP The costs for exploiting ILP are prohibitive in terms of silicon area and of power consumption. Multicore processor have altered the game Shifted the burden for keeping the processor busy from the hardware and architects to application developers and programmers. Shift from ILP to TLP

Related Documents:

Part One: Heir of Ash Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Chapter 11 Chapter 12 Chapter 13 Chapter 14 Chapter 15 Chapter 16 Chapter 17 Chapter 18 Chapter 19 Chapter 20 Chapter 21 Chapter 22 Chapter 23 Chapter 24 Chapter 25 Chapter 26 Chapter 27 Chapter 28 Chapter 29 Chapter 30 .

435. 438 530. 55 V 60 V. ACME Acme-PT (Partial Topping) B Buttress Thread BSPT British Std Pipe Thread ISO ISO Thread NPT National Pipe Thread NPTF National Pipe Thread (Dry Seal) AP Pittsburgh Acme RD API Round Thread ACMEST Stub Acme TR Trapezoidal Thread UN Unified Thread

TO KILL A MOCKINGBIRD. Contents Dedication Epigraph Part One Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Chapter 11 Part Two Chapter 12 Chapter 13 Chapter 14 Chapter 15 Chapter 16 Chapter 17 Chapter 18. Chapter 19 Chapter 20 Chapter 21 Chapter 22 Chapter 23 Chapter 24 Chapter 25 Chapter 26

a square thread with parallel major and minor diam-eters.28Reverse buttress thread shape is optimized for pull-out loads and has parallel major and minor diam-eters.31 The reverse buttress thread design has fewer threads and less thread depth. Square thread shape, called a power thread in engi-neering, reportedly provides an optimized surface area

roll threaded should have a UNR thread because thread rolling dies with rounded crests are now the standard method for manufacturing most threads. UNJ Threads UNJ thread is a thread form having root radius limits of 0.150 to 0.180 times the thread pitch. With these enlarged radii, minor diameters of external thread increase and intrude beyond .

3. Bobbin Thread Cutter 4. Spool Holder (Large) 5. Spool Pin 6. Bobbin Winding Thread Guide 7. Thread Guide Plate 8. Thread Take-Up Lever 9. Thread Tension Dial 10. Face Plate 11. Thread Cutter & Thread Holder 12. Needle Threader 13. Needle Plate 14. Hook Cover Plate 15. Extension Table (Accessory Box) 16

ESCUTCHEON CATALOG LARSEN SUPPLY CO., INC. FAUCET REPAIR E-5 03-1713 Carded 5/8" - 18 Thread -- 9/16" - 20 Thread American Standard 03-1717 Carded 5/8" - 18 Thread -- 5/8" - 16 Thread Kohler 03-1719 Carded 5/8" - 18 Thread -- 9/16" - 18 Thread Crane BRASS REDUCING NIPPLES (USE TO PUT PRICE PFISTER ESCUTCHEONS ON OTHER MANUFACTURER'S VALVES)

ANSI/B1.20.1 ISO 7 Size EN 837-1 G ⅛ B, male thread G ¼ B, male thread M10 x 1, male thread ANSI/B1.20.1 ⅛ NPT, male thread ¼ NPT, male thread ISO 7 R ⅛, male thread R ¼, male thread Materials (wetted) Measuring element Copper alloy Process connection with lower measuring flangeCopper alloy