Piranha - Barroso

2y ago
12 Views
2 Downloads
1.62 MB
31 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Callan Shouse
Transcription

Piranha:Designing a Scalable CMP-based System forCommercial WorkloadsLuiz André BarrosoWestern Research LaboratoryApril 27, 2001Asilomar Microcomputer Workshop

What is Piranha?lAscalable shared memory architecture based on chipmultiprocessing (CMP) and targeted at commercialworkloadslAresearch prototype under development by CompaqResearch and Compaq NonStop Hardware DevelopmentGrouplAdeparture from ever increasing processor complexityand system design/verification cycles

Importance of Commercial ApplicationsWorldwide Server Customer Spending (IDC 1999)Scientific & Otherengineering opment14%Decisionsupport14%l TotalBusinessprocessing22%server market size in 1999: 55-60B– technical applications: less than 6B– commercial applications: 40B

Price Structure of Serversl IBMNormalized breakdown of HW costeServer 680(220KtpmC; 43/tpmC)§ 24 CPUs§ 96GB DRAM, 18 TB Disk§ 9M price tagl CompaqProLiant ML370(32KtpmC; 12/tpmC)§ 4 CPUs§ 8GB DRAM, 2TB Disk§ 240K price IBM eServer 680SystemCompaq ProLiant ML570Price per component /CPU /MB DRAM /GB DiskIBM eServer 680 65,417Compaq ProLiant ML570 6,048 9 4 359 64- Storage prices dominate (50%-70% in customer installations)- Software maintenance/management costs even higher (up to 100M)- Price of expensive CPUs/memory system amortized

Outlinel Importanceof Commercial Workloadsl Commerciall TrendsWorkload Requirementsin Processor Designl Piranhal DesignMethodologyl Summary

Studies of Commercial WorkloadslCollaboration with Kourosh Gharachorloo (Compaq WRL)– ISCA’98: Memory System Characterization of Commercial Workloads(with E. Bugnion)– ISCA’98: An Analysis of Database Workload Performance onSimultaneous Multithreaded Processors(with J. Lo, S. Eggers, H. Levy, and S. Parekh)– ASPLOS’98: Performance of Database Workloads on Shared-MemorySystems with Out-of-Order Processors(with P. Ranganathan and S. Adve)– HPCA’00: Impact of Chip-Level Integration on Performance of OLTPWorkloads(with A. Nowatzyk and B. Verghese)– ISCA’01: Code Layout Optimizations for Transaction ProcessingWorkloads(with A. Ramirez, R. Cohn, J. Larriba-Pey, G. Lowney, and M. Valero)

Studies of Commercial Workloads: summaryl Memory––––astronomically high CPIdominated by memory stall timesinstruction stalls as important as data stallsfast/large L2 caches are criticall Very––––system is the main bottleneckpoor Instruction Level Parallelism (ILP)frequent hard-to-predict brancheslarge L1 miss ratiosLd-Ld dependenciesdisappointing gains from wide-issue out-of-order techniques!

Outlinel Importanceof Commercial Workloadsl Commerciall TrendsWorkload Requirementsin Processor Designl Piranhal DesignMethodologyl Summary

Increasing Complexity of Processor Designsl Pushinglimits of instruction-level parallelism– multiple instruction issue– speculative out-of-order (OOO) executionl Drivenby applications such as SPECl Increasing design time and team sizeProcessor(SGI sto rCount(millions)0.101.406.80DesignTeamSize2055 100DesignTi m e(months)152436VerificationTeam S ize(% of total)15%20% 35%courtesy: John Hennessy, IEEE Computer, 32(8)l Yieldingdiminishing returns in performance

Exploiting Higher Levels of Integrationll1.5MBL2 Network Interface1GHz21264 CPU64KB 64KBI D MSinglechipCoherence EngineMEM-CTLMEM-CTL310310Alpha 21364364364IOIOMM364364IOIOMM364lower latency, higher bandwidthreuse of existing CPU coreaddresses complexity issues364IOI/OlMIOincrementally scalableglueless multiprocessing

Exploiting Parallelism in Commercial AppsChip Multiprocessing (CMP)MEM-CTLthread 1thread 2thread 3thread 4I MEM-CTLtimeCPUExample: Alpha 21464D L2 CPUI D CoherenceNetworkSimultaneous Multithreading (SMT)Example: IBM Power4lSMT superior in single-thread performancelCMP addresses complexity by using simpler coresI/O

Outlinel Importanceof Commercial Workloadsl Commerciall TrendsWorkload Requirementsin Processor Designl Piranha– Architecture– Performancel DesignMethodologyl Summary

Piranha Projectl Explorechip multiprocessing for scalable serversl Focus on parallel commercial workloadsl Small team, modest investment, short design timel Address complexity by using:– simple processor cores– standard ASIC methodologyGive up on ILP, embrace TLP

Piranha Team MembersResearch–––––––Luiz André Barroso (WRL)Kourosh Gharachorloo (WRL)David Lowell (WRL)Joel McCormack (WRL)Mosur Ravishankar (WRL)Rob Stets (WRL)Yuan Yu (SRC)NonStop Hardware DevelopmentASIC Design Center––––––––Tom HeynemannDan JoyceHarland MaxwellHarold MillerSanjay SinghScott SmithJeff Sprouse several contractorsFormer ContributorsRobert McNamaraBasem NayfehAndreas NowatzykJoan PendletonShaz QadeerBrian RobinsonBarton SanoDaniel ScalesBen Verghese

Piranha Processing NodeAlpha core:MEM-CTL MEM-CTL MEM-CTL MEM-CTLCPUCPUCPUCPUHEL2 I D L2 I D L2 I D L2 I D 1-issue, in-order,500MHzL1 caches:I&D, 64KB, 2-wayIntra-chip switch (ICS)32GB/sec, 1-cycle delayRouterL2 cache:shared, 1MB, 8-wayICSI D RE L2 I D L2 CPUMemory Controller (MC)I D L2 CPUI D L2 CPURDRAM, 12.8GB/secProtocol Engines (HE & RE):µprog., 1K µinstr.,even/odd interleavingSystem Interconnect:CPUMEM-CTL MEM-CTL MEM-CTL MEM-CTL4-port Xbar routertopology independent32GB/sec total bandwidthSingle Chip

2 Links @8GB/sRouterPiranha I/O NodeCPUHEI D D PCI-XFBICSFBRE L2 MEM-CTLlI/O node is a full-fledged member of system interconnect– CPU indistinguishable from Processing Node CPUs– participates in global coherence protocol

Example ConfigurationPPPP- I/OP- I/OPPPlArbitrary topologieslMatch ratio of Processing to I/O nodes to application requirements

L2 Cache and Intra-Node Coherencel Noinclusion between L1s and L2 cache– total L1 capacity equals L2 capacity– L2 misses go directly to L1– L2 filled by L1 replacementsl L2keeps track of all lines in the chip– sends Invalidates, Forwards– orchestrates L1-to-L2 write-backs to maximizechip-memory utilization– cooperates with Protocol Engines to enforcesystem-wide coherence

Inter-Node Coherence Protocoll ‘Stealing’ECC bits for memory directory8x(64 8)4X(128 9 7)2X(256 10 22) 1X(512 11 53)Data-bitsECCDirectory-bits028l Directory4453(2b state 40b sharing info)state2binfo on sharers20bstate2binfo on sharers20bl Dualrepresentation: limited pointer coarse vectorl “Cruise Missile” Invalidations (CMI)CMI– limit fan-out/fan-in serialization with CVl Severalnew protocol optimizations010000001000

Simulated Architectures

Single-Chip Piranha Performance350Normalized Execution 10044340P1INOOOOP8500 MHz 1GHz1GHz 500MHz1-issue 1-issue 4-issue 1-issueOLTPP1INOOOOP8500 MHz 1GHz1GHz 500MHz1-issue 1-issue 4-issue 1-issueDSSlPiranha’s performance margin 3x for OLTP and 2.2x for DSSlPiranha has more outstanding misses è better utilizes memory system

Single-Chip Performance (Cont.)(Cont.)8Normalized Breakdown of L1Misses (%)1007Speedup654321090807060L2 MissL2 FwdL2 Hit50403020100012345678Number of Coresl Near-linearP1P2P4P8500 MHz, 1-issuescalability– low memory latencies– effectiveness of highly associative L2 and non-inclusive caching

Normalized Execution TimePotential of a Full-Custom Piranha120100100100L2 MissL2 ssueDSSP8F1.25GHz1-issue5x margin over OOO for OLTP and DSSFull-custom design benefits substantially from boost in core speed

Outlinel Importanceof Commercial Workloadsl Commerciall TrendsWorkload Requirementsin Processor Designl Piranhal DesignMethodologyl Summary

Managing Complexity in the Architecturel Use–––––of many simpler logic modulesshorter designeasier verificationonly short wires*faster synthesissimpler chip-level layoutl Simplifyintra-chip communication– all traffic goes through ICS (no backdoors)l Useof microprogrammed protocol enginesl Adoption of large VM pagesl Implement sub-set of Alpha ISA– no VAX floating point, no multimedia instructions, etc.

Methodology Challengesl Isolatedsub-module testing– need to create robust bus functional models (BFM)– sub-modules’ behavior highly inter-dependent– not feasible with a small teaml System-level––––(integrated) testingmuch easier to create testsonly one BFM at the processor interfacesimpler to assert correct operationVerilog simulation is too slow for comprehensive testing

Our Approach:l Designin stylized C (synthesizable RTL level)– use mostly system-level, semi-random testing– simulations in C (faster & cheaper than Verilog)§ simulation speed 1000 clocks/second– employ directed tests to fill test coverage gapsl Automatic––––C to Verilog translationsingle design databasereduce translation errorsfaster turnaround of design changesrisk: untested methodologyl Usingl IBMindustry-standard synthesis toolsASIC process (Cu11)

Piranha Methodology: OverviewC RTLModelsC RTL Models: Cycleaccurate and “synthesizeable”CLevelcxxcxxVerilogModelsPS1: Fast (C ) LogicSimulatorVerilog Models: Machinetranslated from C modelsPhysical Design: leveragesindustry standard Verilog-basedtoolsPS1PS1VPhysicalDesigncxx: C compilerCLevel: C -to-Verilog TranslatorPS1V: Can “co-simulate” C and Verilog module versionsand check correspondence

Summaryl CMParchitectures are inevitable in the near futurel Piranhainvestigates an extreme point in CMP design– many simple coresl Piranhahas a large architectural advantage over complexsingle-core designs ( 3x) for database applicationsl Piranhal Keymethodology enables faster design turnaroundto Piranha is application focus:– One-size-fits-all solutions may soon be infeasible

Referencel Paperson commercial workload performance & Piranharesearch.compaq.com/wrl/projects/Database

lCompaq ProLiant ML370 (32KtpmC; 12/tpmC) §4 CPUs §8GB DRAM, 2TB Disk . IBM eServer 680 Compaq ProLiant ML570 I/O DRAM CPU Base /CPU /MB DRAM /GB Disk IBM eServer 680 65,417 9 359 Compaq ProLiant ML570 6,048 4 64 Price per component System. OutlineOutline l

Related Documents:

a) Storage and Waste Handling Do not store Piranha solution. Mix fresh solution for each use. The primary hazard from storage of Piranha etch waste is the potential for gas generation and over pressurization of the container when the solution is still hot. If you store a hot solution in an air tight container, it will explode! Prior to

“Use your private parts as piranha bait” Character jumps into water, black censor square covering genital area. (It’s important to see his whole body first, so you realise he’s been eaten later.) Starts gyrating in a hu

PIRANHA MARINE WASTEWATER SYSTEMS MEPC 227(64) Compliant . Disaster Relief Water Purification Units RV Zero Dump (discharge) systems . Portable Water Purification Units Portable Sewage Treatment Plants (10mᶟ/2,640gallons/day) Closed Loop Water Reclamation Systems

HACKER THIEF HEIST CON CAPER MUSCLE PIRANHA'S WORD SEARCH HELP PIRANHA FIND ALL 15 HIDDEN WORDS OR PHRASES BELOW by . Can you find your way through the spider web maze without crossing any breaks in the web? SCAN ME FOR MORE FUN STUFF! MR. snake Expert safe cracker, and Mr. Wolf's cynical best friend.

HACKER THIEF HEIST CON CAPER MUSCLE PIRANHA'S WORD SEARCH HELP PIRANHA FIND ALL 15 HIDDEN WORDS OR PHRASES BELOW by . Can you find your way through the spider web maze without crossing any breaks in the web? SCAN ME FOR MORE FUN STUFF! MR. snake Expert safe cracker, and Mr. Wolf's cynical best friend.

The Piranha press brake is a heavy duty, high performance hydraulic powered machine that provides several important advantages surpassing other press brakes in today's market. The Piranha's single hydraulic cylinder mechanical linkage system provides full tonnage at any point across the bed.

The area around the Piranha 350 Press Brake should be well lighted, dry, and free of obstacles. The Piranha 350 Press Brake is designed for single person operation only. Always insure that all tooling is properly secured in position before starting any operation. When servicing the machine always practice standard lockout/tag-out procedures to .

The area around the Piranha 175 Press Brake should be well lighted, dry, and free of obstacles. The Piranha 175 Press Brake is designed for single person operation only. Always insure that all tooling is properly secured in position before starting any operation. When servicing the machine always practice standard lockout/tag-out procedures to .