Multiprocessor Hardware Multiprocessor Operating Systems

1y ago
9 Views
2 Downloads
1.54 MB
9 Pages
Last View : 14d ago
Last Download : 3m ago
Upload by : Rafael Ruffin
Transcription

Operating Systems12/4/2018Multiprocessor Hardware A computer system in which two or more CPUs share fullaccess to the main memory Each CPU might have its own cache and the coherence amongmultiple caches is maintained– write operation by a CPU is visible to all other CPUsMultiprocessor OperatingSystems– writes to the same location is seen in the same order byall CPUs (also called write serialization)CS 256/456Dept. of Computer Science, Universityof Rochester12/4/2018CSC 256/4561CPUCache– bus snooping and cacheinvalidationCSC 256/45612/4/20182Single-processor OS vs. MultiprocessorOS Single-processor OS– Multiple regular applications running concurrently– easier to support kernel synchronization coarse-grained locking vs. fine-grain locking disabling interrupts to prevent concurrent executions Concurrent servers– Web servers, – easier to perform scheduling which to run, not where to run Parallel programs– Utilizing multiple processors to complete one task(parallel matrix multiplication, Gaussian elimination)xB– Strong synchronizationCSC 256/456 Memory Multiprogramming12/4/2018CPUCacheMemory busMultiprocessor ApplicationsACPUCacheCSC 256/456 Multiprocessor OS– evolution of OS structure– synchronization– schedulingC312/4/2018CSC 256/45641

Operating Systems12/4/2018Multiprocessor OS – Master/SlaveMultiprocessor OSBusBus Each CPU has its own operating system– quick to port from a single-processor OS Disadvantages– difficult to share things (processing cycles, memory,buffer cache)12/4/2018CSC 256/456 All operating system functionality goes to one CPU– no multiprocessor concurrency in the kernel Disadvantage– OS CPU consumption may be large so the OS CPUbecomes the bottleneck (especially in a machine withmany CPUs)512/4/2018Multiprocessor OS – Shared OSCSC 256/4566Preemptive Scheduling Use timer interrupts or signals to trigger involuntaryyields Protect scheduler data structures by locking ready list,disabling/reenabling prior to/after reschedulingyield:disable signalsenqueue(ready list, current)reschedulere-enable signalsBus A single OS instance may run on all CPUs The OS itself must handle multiprocessor synchronization– multiple OS instances from multiple CPUs may accessshared data structure12/4/2018CSC 256/456CSC 256/456712/4/2018CSC 256/45682

Operating Systems12/4/2018Synchronization (Fine/Coarse-GrainLocking)Anderson et al. 1989 (IEEE TOCS) Raises issues of– Locality (per-processor data structures)– Granularity of scheduling tasks– Lock overhead– Tradeoff between throughput and latency Fine-grain locking – lock only what is necessary for criticalsection Coarse-grain locking – locking large piece of code, much ofwhich is unnecessary– simplicity, robustness– prevent simultaneous executionSimultaneous execution is not possible on uniprocessoranyway12/4/2018CSC 256/456CSC 256/456 Large critical sections are good for best-caselatency (low locking overhead) but bad forthroughput (low parallelism)912/4/2018CSC 256/45610Performance MeasuresOptimizations Latency– Cost of thread management under the bestcase assumption of no contention for locks Throughput– Rate at which threads can be created, started,and finished when there is contention Allocate stacks lazily Store deallocated control blocks and stacks infree lists Create per-processor ready lists Create local free lists for locality Queue of idle processors (in addition to queue ofwaiting threads)12/4/201812/4/2018CSC 256/45611CSC 256/456123

Operating Systems12/4/2018Ready List ManagementMultiprocessor Scheduling Timesharing Single lock for all data structures Multiple locks, one per data structure Local freelists for control blocks and stacks,single shared locked ready list Queue of idle processors with preallocatedcontrol block and stack waiting for work Local ready list per processor, each with its ownlock– similar to uni-processor scheduling – one queue ofready tasks (protected by synchronization), a taskis dequeued and executed when a processor isavailable Space sharing cache affinity– affinity-based scheduling – try to run each processon the processor that it last ran on cache sharing and synchronization of parallel/concurrentapplications– gang/cohort scheduling – utilize all CPUs for oneparallel/concurrent application at a timeCPU 0CPU 1web server12/4/2018CSC 256/45613SMP-CMP-SMT Multiprocessor12/4/2018parallel Gaussianeliminationclient/servergame (civ)CSC 256/45614The Performance TransparencyChallengeModern multicore systems 10s to 100s of hardware contextsSource: iew-intel-core-i74770k-i54560k-tested/5 Resource sharing– Functional units, caches, on- and off-chipinterconnects, memory Non-uniform access latenciesImage12/4/2018CSC 256/456from http://www.eecg.toronto.edu/ tamda/papers/threadclustering.pdfCSC 256/456154

Operating Systems12/4/2018Operating System Level ResourceManagement To Date Capitalistic - generation of more requests resultsin more resource usage– Performance: resource contention can resultin significantly reduced overall performance– Fairness: equal time slice does notnecessarily guarantee equal progress– Sharing oblivious: extraneous communicationdue to poor placement17Multi-Core Cache Challenges Hardware manages cache at the grain of cachelines.– single program: data with different locality are mixedtogether– shared cache: uncontrolled sharing threads– compete for space - interference Using OS as an auxiliary to manage cache– high-level knowledge of program– running state of the entire system– how? page coloringAddress Mapping in CacheWhat is a Page9 Color?bits set index physical memory address and cache– cache size line size * way size * # of set– 256KB, 16-way, 32B line size L2 cache: 5129 bits5 bitssets (9 bits to index)32 bits addr.set indexWay-1 cache setL2 CacheWay-16line offsetphysical addr.physical page #5 bits line off.page offset (12bits)color index (2 bits- 4 colors) color:– cache: a group of cache sets– memory: a group of physical pages– (page N, N 4, N 8, )L2 Cache page color:– data belonging to a page colorcan only be cached by cachesets with same colorphysicalmem pagesCSC 256/4565

Operating Systems12/4/2018Software cache partitioning bypage coloring – under OS control[EuroSys 2009]OS and Page Coloring What is the role of the OS?– control the mapping between virtual memorypages and physical pages via page tablevirtual addr.physical addr.OS and Page Coloring Color a page: map a virtual page to a physicalpage with a particular color (lower bits of page #)virtual addr.virtual page numbervirtual page #page offsetpage tableunder OScontrolphysical page #page offsetIntel’s Hardware-Supported CachePartitioning (2016) Intel Cache Allocation Technology – partition byways (rather than sets) by writing to registers tospecify the range of ways each core can accesspage offsetphysical pages Re-color a page: change the color at runtime– flush TLB, copy page data, modify page table– Incurs significant overhead (function of amount ofdata moved)CSC 256/456App1 partition12/4/2018App2 partitionCSC 2/456App3 partition246

Operating Systems12/4/2018Hardware Execution Throttling[Usenix 2009]Big PictureControl resource usage of co-runningapplications Instead of directly controlling cache resource allocation,throttle the execution speed of application thatoveruses resourcePage coloring or Hardware throttling Available throttling knobsSelect which applications runtogetherResource-aware schedulingABCD .– Duty-cycle modulation– Frequency/voltage scaling– Cache prefetchersXNew Mechanism:Hardware Execution Throttling [Usenix’09] Throttle the execution speed of app that overuses cache– Duty cycle modulation CPU works only in duty cycles and stalls in non-duty cycles Different from Dynamic Voltage Frequency Scaling– Per-core vs. per-processor control– Thermal vs. power management– Enable/disable cache prefetchers L1 prefetchersComparing Hardware ExecutionThrottling to Page Coloring Kernel code modification complexity– Code length: 40 lines in a single file, as a reference ourpage coloring implementation takes 700 lines of codecrossing 10 files Runtime overhead of configuration– Less than 1 microseconds, as a reference re-coloring apage takes 3 microseconds– IP: keeps track of instruction pointer for load history– DCU: when detecting multiple loads from the same line within a time limit,prefetches the next line L2 prefetchers– Adjacent line: Prefetches the adjacent line of required data– Stream: looks at streams of data for regular patternsCSC 256/4567

Operating Systems12/4/2018Existing Mechanism(II):Scheduling Quantum AdjustmentDrawback of Scheduling QuantumCoarse-grained control atAdjustmentscheduling quantum granularity mayresult in fluctuating service delays for individual transactions Shorten the time slice of app that overuses cache May let core idle if there is no other active threadavailableCore 0Thread ACore 1idleThread AThread BThread BidleThread AidleThread BtimeComparison of Hardware ExecutionThrottling to other two mechanismsComparison of Hardware ExecutionThrottling to other two mechanisms Comparison to page coloring– Little complexity to kernel Comparison to page coloring– Little complexity to kernel Code length: 40 lines in a single file, as a reference our page coloring implementationtakes 700 lines of code crossing 10 files Code length: 40 lines in a single file, as a reference our page coloring implementationtakes 700 lines of code crossing 10 files– Lightweight to configure– Lightweight to configure Read plus write register: duty-cycle 265 350 cycles, prefetcher 298 2065 cycles Less than 1 microseconds, as a reference re-coloring a page takes 3 microseconds Read plus write register: duty-cycle 265 350 cycles, prefetcher 298 2065 cycles Less than 1 microseconds, as a reference re-coloring a page takes 3 microseconds Comparison to scheduling quantum adjustment Comparison to scheduling quantum adjustment– More fine-grained controllingQuantum adjustmentCore 0Core 1CSC 256/456Thread AThread B– More fine-grained controllingHardware execution throttlingQuantum adjustmentCore 0idletimeCore 1Thread AThread BHardware execution throttlingidletime8

Operating Systems12/4/2018Policies for Hardware ThrottlingEnabled Multicore Management User-defined service level agreements (SLAs)– Proportional progress among competing threads Unfairness metric: coefficient of variation of threads’ performance– Quality of service guarantee for high-priority application(s) Key challenge– Throttling configuration space grows exponentially asthe number of cores increases– Quickly determining optimal or close to optimalthrottling configurations is challengingCSC 256/456TEMM: A Flexible Framework forThrottling-Enabled MulticoreManagement [ICPP’12] Customizable performance estimation model Reference configuration set and linear approximation Currently incorporates duty cycle modulation andfrequency/voltage scaling Iterative refinement Prediction accuracy gets improved over time as moreconfigurations are added into reference set9

Operating Systems 12/4/2018 CSC 256/456 1 12/4/2018 CSC 256/456 1 Multiprocessor Operating Systems CS 256/456 Dept. of Computer Science, University of Rochester 12/4/2018 CSC 256/456 2 Multiprocessor Hardware A computer system in which two or more CPUs share full access to the main memory Each CPU might have its own cache and the .

Related Documents:

Multiprocessor Operating Systems CS 6410: Advanced Systems Kai Mast Department of Computer Science . Management".In: 1995, pp. 251-266. [6]Benjamin Gamsa and Benjamin Gamsa."Tornado: . Shared-Memory Multiprocessor Operating System".In: In Proceedings of the 3rd Symposium on Operating Systems Design and Implementation (OSDI. 1999, pp .

the way multiprocessor systems operate, i.e., multiprocessor systems require parallel application specifications. A. Problem Description For all the reasons stated above, we conclude the following. 1) The use of an RTL system specification as a starting point for multiprocessor system design methodologies is a bottleneck.

Multiprocessor, Total Processing Time, Edge-zeroing, Interleaving, Genetic Algorithms. I. INTRODUCTION . The multiprocessor scheduling problem is generally stated as follows: Given a multiprocessor computing system and a specific number of tasks to execute, "how does one efficiently schedule the tasks to make optimal use of the

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 27, NO. 10, OCTOBER 2008 1701 . OCTOBER 2008 1701 Multiprocessor System-on-Chip (MPSoC) Technology Wayne Wolf, Fellow, IEEE, Ahmed Amine Jerraya, and Grant Martin, Senior Member, IEEE Abstract—The multiprocessor system-on-chip (MPSoC) uses multiple CPUs .

AES Encryption - A multiprocessor AES encryption implementation FFT Filter - a multiprocessor FFT filter using shared memory FFT Filter 2 - a multiprocessor FFT filter using communication chan

The Art of Multiprocessor Programming. Thispageintentionallyleftblank. The Art of Multiprocessor Programming MauriceHerlihy NirShavit AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO SAN FRANCISCO SINGAPORE SYDNEY TOKYO Morgan Kaufmann Publishers is an imprint of Elsevier.

The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Art of Multiprocessor Programming. 2 Last Lecture Defined concurrent objects using linearizability and sequential consistency Fact: implemented linearizable objects (Two thread FIFO Queue) in read-write

there will be several sections to the written test in addition to reading comprehension; thus, it is to your benefit to carefully read the job bulletin to determine the knowledge, skill, and ability areas the written test will cover. In addition, it is important that you read the entire written test notice for the location and time of the written test as well as for parking instructions and .