Performance Evaluation Of Intel R Transactional Synchronization .

1y ago

7 Views

3 Downloads

609.14 KB

11 Pages

Last View : 21d ago

Last Download : 3m ago

Upload by : Julius Prosser

Report this link

Download PDF

Transcription

Performance Evaluation of Intel TransactionalSynchronization Extensions for High-PerformanceComputingRRichard M. Yoo†richard.m.yoo@intel.comChristopher J. Hughes†christopher.j.hughes@intel.comRavi Rajwar‡ravi.rajwar@intel.comKonrad Lai‡konrad.lai@intel.com†Parallel Computing LaboratoryIntel LabsSanta Clara, CA 95054ABSTRACT‡Intel Architecture Development GroupIntel Architecture GroupHillsboro, OR 971241.RIntel has recently introduced Intel Transactional Synchronization Extensions (Intel R TSX) in the Intel 4th Generation CoreTM Processors. With Intel TSX, a processorcan dynamically determine whether threads need to serialize through lock-protected critical sections. In this paper, weevaluate the first hardware implementation of Intel TSX using a set of high-performance computing (HPC) workloads,and demonstrate that applying Intel TSX to these workloadscan provide significant performance improvements. On a setof real-world HPC workloads, applying Intel TSX providesan average speedup of 1.41x. When applied to a paralleluser-level TCP/IP stack, Intel TSX provides 1.31x averagebandwidth improvement on network intensive applications.We also demonstrate the ease with which we were able toapply Intel TSX to the various workloads.Categories and Subject DescriptorsB.8.2 [Hardware]: Performance and Reliability—performance analysis and design aids; C.1.4 [Computer Systems Organization]: Processor Architectures—parallel architectures; D.1.3 [Software]: Programming Techniques—concurrent programmingGeneral TermsPerformance, MeasurementKeywordsTransactional Memory, High-Performance ComputingPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post onservers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from permissions@acm.org.SC13, November 17–21, 2013, Denver, CO, USA.Copyright is held by the owner/author(s). Publication rights licensed to ACM.ACM 978-1-4503-2378-9/13/11 . RODUCTIONDue to limits in technology scaling, software developershave come to rely on thread-level parallelism to obtain sustainable performance improvement. However, except for thecase where the computation is massively parallel (e.g., dataparallel applications), performance of threaded applicationsis often limited by how inter-thread synchronization is performed. For example, using coarse-grained locks can limitscalability, since the execution of lock-guarded critical sections is inherently serialized. Using fine-grained locks, incontrast, may provide good scalability, but increases lockingoverheads, and can often lead to subtle bugs.Various proposals have been made over the years to address the limitations of lock-based synchronization. Lockfree algorithms support concurrent updates to data structures and do not require mutual exclusion through a lock.However, such algorithms are very difficult to write and maynot perform as well as their lock-based counterparts. Hardware transactional memory [11] and Oklahoma Update Protocol [26] propose hardware support to simplify the implementation of lock-free data structures. They rely on mechanisms other than locks to ensure forward progress. Speculative Lock Elision [22] proposes hardware support to expose concurrency in lock-based synchronization—the hardware would optimistically execute critical sections withoutserialization and serialize execution only when necessary. Inspite of these proposals, writing correct, high-performancemulti-threaded programs remains quite challenging.Intel has introduced Intel R Transactional Synchronization Extensions (Intel R TSX) in the Intel 4th GenerationCoreTM Processors [12] to improve the performance of critical sections. With Intel TSX, the hardware can dynamically determine whether threads need to serialize throughlock-protected critical sections. Threads perform serialization only if required for correct execution. Hardware canthus expose concurrency that would have been hidden dueto unnecessary synchronization.In this paper we apply Intel TSX to a set of workloads inthe high-performance computing (HPC) domain and presentthe first evaluation of the performance benefits when running on a processor with Intel TSX support. The evaluation incorporates a broad spectrum of workloads, rangingfrom kernels and benchmark suites to a set of real-world

workloads and a parallel user-level TCP/IP stack. Someof the workloads were originally written to stress test athroughput-oriented processor [24], and have been optimizedfor the HPC domain. Nevertheless, applying Intel TSX tothese workloads provides an average speedup of 1.41x. Applying Intel TSX to a user-level TCP/IP stack provides anaverage bandwidth improvement of 1.31x on a set of networkintensive applications. These results are in contrast to priorwork on other commercial implementations that show littleto no performance benefits [23, 29], or are limited to smallkernels and benchmarks [20, 6, 5].We demonstrate multiple sources of performance gains.The dynamic avoidance of unnecessary serialization allowsmore concurrency and improves scalability. In other cases,we reduce the cost of uncontended synchronization operations, and achieve performance gains even in single threadexecutions. Much of the gain is achieved with changes just inthe synchronization library: In some cases, localized changesin the application code results in additional gains.Section 2 presents a brief overview of Intel TSX. We describe the experimental setup in Section 3 and outline howwe apply Intel TSX to the various workloads. Section 4 characterizes Intel TSX using a suite of benchmarks. We evaluate Intel TSX for these benchmarks without any source codechanges. In Section 5, we evaluate Intel TSX performance ona set of real-world workloads. We also demonstrate two keytechniques to further improve performance: lockset elisionand transactional coarsening. These techniques are useful ifone can modify the source code to optimize for performance.In Section 6, we apply Intel TSX to a large-scale softwaresystem using a user-level TCP/IP stack and identify someof the challenges, such as condition variables. We discussrelated work in Section 7 and conclude in Section 8.2.INTEL R TRANSACTIONALSYNCHRONIZATION EXTENSIONSIntel TSX provides developers an instruction set interfaceto specify critical sections for transactional execution1 . Thehardware executes these developer-specified critical sectionstransactionally, and without explicit synchronization and serialization. If the transactional execution completes successfully (transactional commit), then memory operations performed during the transactional execution appear to haveoccurred instantaneously, when viewed from other processors. However, if the processor cannot complete its transactional execution successfully (transactional abort), then theprocessor discards all transactional updates, restores architectural state, and resumes execution. The execution maythen need to serialize through locking if necessary, to ensureforward progress. The mechanisms to track transactionalstates, detect data conflicts, and commit atomically or rollback transactional states are all implemented in hardware.Intel TSX provides two software interfaces to specify critical sections. The Hardware Lock Elision (HLE) interface isa legacy compatible instruction set extension (XACQUIREand XRELEASE prefixes) for programmers who would liketo run HLE-enabled software on legacy hardware, but wouldalso like to take advantage of the new transactional execu1Full specifications for Intel TSX can be found in [12].Enabling and optimization guidelines can also be foundin [13]. Additional resources for Intel TSX can be foundat http://www.intel.com/software/tsx.tion capabilities on hardware with Intel TSX support. Restricted Transactional Memory (RTM) is a new instructionset extension (comprising the XBEGIN and XEND instructions) for programmers who prefer a more flexible interfacethan HLE. When an RTM region aborts, architectural stateis recovered, and execution restarts non-transactionally atthe fallback address provided with the XBEGIN instruction.Intel TSX does not guarantee that a transactional execution will eventually commit. Numerous architectural andmicroarchitectural conditions can cause aborts. Examplesinclude data conflicts, exceeding buffering capacity for transactional states, and executing instructions that may alwaysabort (e.g., system calls). Software using RTM instructionsshould not rely on the Intel TSX execution alone for forward progress. The fallback path not using Intel TSX support must ensure forward progress, and it must be able torun successfully without Intel TSX. Additionally, the transactional path and the fallback path must co-exist withoutincorrect interactions.Software using the RTM instructions for lock elision musttest the lock during the transactional execution to ensurecorrect interaction with another thread that may or alreadyhas explicitly acquired the lock non-transactionally, and shouldabort if not free. The software fallback handler should define a policy to retry transactional execution if the lock isnot free, and to explicitly acquire the lock if necessary.When using the Intel TSX instructions to implement lockelision, whether through the HLE or RTM interface, thechanges required to enable the use of these instructions arelimited to synchronization libraries, and do not require application software changes.The first implementation of Intel TSX on the 4th Generation CoreTM microarchitecture uses the first level (L1) datacache to track transactional states. All tracking and dataconflict detection are done at the granularity of a cache line,using physical addresses and the cache coherence protocol.Eviction of a transactionally written line from the data cachewill cause a transactional abort. However, evictions of linesthat are only transactionally read do not cause an abort;they are moved into a secondary structure for tracking, andmay result in an abort at some later time.3.EXPERIMENTAL SETUPWe use an Intel 4th Generation CoreTM processor withIntel TSX support. The processor has 4 cores with 2 HyperThreads per core, for a total of 8 threads. Each core hasa 32 KB L1 data cache. We use Intel C/C compilerfor most of our studies, but for those applications utilizingOpenMP, we also use GCC with libgomp to precisely controlthe number of threads. We use inline assembly to emit bytesfor Intel TSX instructions, but intrinsics are also availablethrough compiler header files (e.g., immintrin.h ).Unless otherwise noted, we use thread affinity to bindthreads to cores so that as many cores are used as possible—e.g., a 4 thread run will use a single thread on each of the 4cores, while an 8 thread run will also use 4 cores, but with 2threads per core. A minimum of 10 executions are averagedto derive statistically meaningful results.The workloads we use in this paper include transactionalmemory benchmark suites (CLOMP-TM [23], STAMP [19],and RMS-TM [16]), real-world applications from the HPCdomain, and a large-scale software system with a TCP/IPstack running network intensive applications.

4.EVALUATION ON TRANSACTIONALMEMORY BENCHMARKSWe start by using CLOMP-TM [23] microbenchmark tocharacterize Intel TSX performance, and then use the STAMPbenchmark suite [19] to see how such performance translatesto workload performance. We also apply Intel TSX to RMSTM [16], and observe how it compares to fine-grained locking, and how it interacts with system calls during a transactional execution.These transactional memory (TM) benchmark suites usemacros and pragmas to invoke the underlying TM library. Inaddition to a TM implementation, the library also providesa lock-based critical section implementation, equivalent to aconventional lock-based execution model using a global lock.We apply Intel TSX to elide the global lock in the criticalsection implementation.4.1CLOMP-TM ResultsIn this section we characterize Intel TSX performanceusing the CLOMP-TM benchmark 1.6 [23]. CLOMP-TMis a synthetic memory access generator that emulates thesynchronization characteristics of HPC applications; an unstructured mesh is divided into partitions, where each partition is subdivided into zones. Threads concurrently modifythese zones to update the mesh.Specifically, each zone is pre-wired to deposit a value to aset of other zones, scatter zones, which involves (1) readingthe coordinate of a scatter zone, (2) doing some computation, and (3) depositing the new value back to the scatterzone. Since threads may be updating the same zone, valuedeposits need to be synchronized. Conflict probability canbe adjusted by controlling how the zones are wired; and bychanging the number of scatters per zone, the amount ofwork done in a critical section can be adjusted.To compare Intel TSX performance against existing synchronization methods, we use the benchmark to reproducethe experiment conducted in [23]. Here, threads do notSmall TMLarge TMSmall CriticalLarge CriticalSmall Atomic3.5Speedup over Serial ExecutionThese workloads use synchronization libraries to coordinate accesses to shared data. An application may either directly call these libraries, or invoke them indirectly throughmacros or pragmas. These underlying libraries provide multiple mechanisms for synchronizing accesses to shared data.If the shared data being updated is a single memory location (an atomic operation), then the library can achieve thisthrough the use of an atomic instruction (such as LOCKprefixed instructions in the Intel 64 architecture). For morecomplex usages, lock-protected critical sections are used.We apply Intel TSX to the underlying synchronizationlibrary, and do not require application source changes or annotations. Specifically, in this paper we use the RTM-basedinterface to elide the relevant critical section locks specifiedby the synchronization library, and execute the critical section transactionally. If the transactional execution is unsuccessful, then the lock may be explicitly acquired to ensureforward progress. The decision to acquire the lock explicitlyis based on the number of times the transactional executionhas been tried but failed; for our hardware and workloads, 5gave the best overall performance. To ensure correct interaction of the transactional execution with other threads thatmay or already has explicitly acquired the lock, the state ofthe lock is tested during the transactional execution.32.521.510.501112131# Scatters per Zone41Figure 1: CLOMP-TM benchmark results for 4threads.Intel TSX version (Large TM) outperforms atomic instruction-based version (SmallAtomic) when at least 3 or 4 scatter zone updatesare batched.contend for memory locations, and to avoid artifacts fromL1 data cache sharing among threads, we disable HyperThreading (i.e., we use 4 threads).Figure 1 shows the results. In the figure, Small Atomicdenotes the case where a LOCK-prefixed instruction is usedto enforce atomicity on a single scatter zone value update;this is equivalent to using #pragma omp atomic. Likewise,Small Critical denotes the use of a lock, equivalent to#pragma omp critical, for each scatter zone update. LargeCritical denotes the case where for each zone, we batch thescatter zone updates (and the accompanying index and valuecomputation code) under a critical section guarded by a single lock. Small TM and Large TM map the lock-guardedcritical sections in Small Critical and Large Critical intocalls into the Intel TSX-enabled synchronization library.The X-axis denotes the number of scatters for each zone,and at each scatter count, the speedup is against the execution time of the corresponding serial version.When we synchronize on each scatter zone update, whilethe LOCK prefix-based version (Small Atomic) is the fastest,Intel TSX version (Small TM) is not too much worse. Theversion that uses lock (Small Critical), however, performsa lot worse. In contrast, batching a set of scatter zone updates into a single critical section allows better amortizationof the synchronization costs. Especially, Intel TSX withbatching (Large TM) outperforms even Small Atomiconce we batch at least 3 4 updates. Batching with lock(Large Critical), however, suffers from lock contentions,and remains slow.Compared to the results presented in [23], which requires5 to 10 updates to be batched before its transactional execution outperforms atomic updates, Intel TSX exhibits loweroverhead. However, the scale at which the transactional execution is implemented on [23] is different (16 cores per chip,4 threads per core). Therefore, a direct comparison cannotbe made.4.2STAMP ResultsSTAMP [19] is a benchmark suite extensively used by thetransactional memory community. Compared to CLOMPTM, its workloads are much closer to a realistic application.We use the benchmark suite (0.9.10) to see how Intel TSXperformance translates into application performance.

Normalized Execution Timeover 1 Thread SGLsgltl2tsx(9.48, 4.95)432101 2 4 81 2 4 81 2 4 81 2 4 81 2 4 8genomeintruderkmeanslabyrinthbayes1 2 4 8ssca21 2 4 8vacation1 2 4 8yada1 2 4 8AVG.Figure 2: STAMP benchmark results. Intel TSX provides low single thread overhead, while outperforming asoftware transactional memory (TL2 [7]) in many ca2vacationyada1 threadtl2tsx064060600087000380462 threadstl2tsx191011321115264950125146684 threadstl2tsx2891195031357181000165258848 threadstl2tsx694188577455961697019996592Table 1: Transactional abort rates (%) for theSTAMP benchmark suite. Figures of particular interest are highlighted.Specifically, some STAMP workloads use critical sectionswith medium/large memory footprint. Memory accesseswithin a critical section that are required for synchronizationcorrectness have been manually annotated for use by software transactional memory (STM) implementations. STMsrely on instrumenting memory accesses within a transactional region to track transactional reads and writes, andsuch annotation allows STMs to only track necessary accesses. When using a lock-based execution, these annotatedaccesses get mapped to regular loads and stores, and aresynchronized using the underlying locking mechanism.Figure 2 shows the execution time of different synchronization schemes implemented by the underlying TM library.We use the native input with high contention configuration.The execution time in the figure is normalized to the singlethread execution time of the sgl version. sgl represents thecase where the TM library implements transactional regionsas critical sections protected through a single global lock.This scheme forces all transactional regions to serialize, andthus prevents scaling if critical sections comprise a significant fraction of an application’s execution. As expected,with increasing thread count, workloads do not scale.tl2 represents the performance where the TM library implements transactional regions using the STM included inthe benchmark distribution, called TL2 [7]. Overall, byleveraging the annotations to only track crucial memory accesses, STM provides good scalability. However, except forlabyrinth, it suffers significant single thread overhead. Thisis because STM has to instrument the annotated memoryaccesses within a transactional region. On a single-threadedexecution, it still pays this overhead, but cannot exploit concurrency to make up for the performance loss.Intel TSX, however, does not require any instrumentation. In the figure, tsx represents the performance wherewe apply Intel TSX to transactionally elide the single globallock in sgl. As can be seen, the Intel TSX-enhanced libraryshows radically improved single thread performance. Specifically, the performance is comparable to single global lock.With more threads, however, Intel TSX scales significantlybetter than single global lock, and in many cases, outperforms STM. With both good single-thread performance andgood scalability, a programmer may elect to apply Intel TSXover coarse-grained locks, instead of the conversion effort tofine-grained locks or suffering the high overheads of STM.Although we provide results on all the workloads for completeness, results on bayes and kmeans should be discounted, because their execution is strongly dependent onthe order of various parallel computations—thus, a slowersynchronization scheme may result in faster benchmark execution, and vice versa. Specifically, bayes utilizes a hillclimbing strategy that combines local and global search [19].We notice that executions with STM consistently get stuckin local minima, terminating the search earlier but returning inferior results. Similarly, kmeans iterates its algorithmuntil the cluster search converges; we notice that an implementation using Intel TSX always converges faster thanSTM. We suspect both cases are related to how this specificSTM implementation handles floating point variables, andare currently investigating the issue.Table 1 shows transactional abort percentage that givesmore insight into TL2 and Intel TSX behavior. We collectIntel TSX statistics through Linux perf. First to note isthe non-trivial abort rate of Intel TSX with only one thread.These aborts are mostly due to the effective capacity limit ofthe set-associative L1 data cache for medium/large criticalsections. Hyper-Threading, on the other hand, increases thepressure on the L1, compounding the capacity issue. Thus,in the table, Intel TSX sees significantly higher transactionalabort rates with 8 threads than with 4 threads.Overall, while STAMP tries to cover diverse transactionalcharacteristics, we see that some workloads stopped criticalsection refinement at medium/large footprint; this wouldnot have been a problem for STMs with virtually unlimited buffer size. STMs also manage to avoid capacity issues through their heavy use of selective annotation (e.g., forlabyrinth, a 14 MB copy of a global structure to threadlocal memory is not annotated). Such manual annotationrequires significant effort, especially in a large-scale softwaresystem [10], and is not possible with high-level transactionalprogramming constructs [27].However, due to its low overhead, Intel TSX providesspeedup over STM in many cases where its capacity-inducedabort rate is reasonable.4.3RMS-TM ResultsThe STAMP benchmark suite is written from the groundup specifically to evaluate transactional memory implementations. In contrast, RMS-TM [16] adapts a set of existing workloads to use transactional memory. As a result,

fgltsxsglSpeedup over1 Thread FGL5432101 2 4 8apriori1 2 4 8fluidanimate1 2 4 8hmm-calibrate1 2 4 81 2 4 8hmm-pfam1 2 4 8hmm-search1 2 4 8scalparcutilitymine1 2 4 8AVG.Figure 3: RMS-TM benchmark results. Intel TSX provides comparable performance to fine-grained locking,even when system calls are made during a transactional s min-cut graph clustering. Kernel 4 of SSCA2 [1].Unstructured Adaptive (UA) from NAS Parallel Benchmarkssuite [9]. Solves a set of heat equations on an adaptive mesh.Uses PSOR to solve a set of 3-D force constraints on groups ofblocks.Non-uniform FFT. Baseline reported in [15].Parallel image histogram construction.VLSI router from PARSEC [2]. Performs simulated ThreadPThreadlocksatomicslock-freeTxn TechniqueLocksetStatCDynC Table 2: Real-world workloads used in this study. Sync denotes the synchronization mechanism used by theoriginal code. Txn Technique represents the transactional optimization techniques we apply (Lockset LocksetElision, StatC Static Coarsening, and DynC Dynamic Coarsening).workloads in RMS-TM exhibit different characteristics fromSTAMP. Specifically, compared to the medium/large transactions used by STAMP, RMS-TM utilizes fine-grained locks.Therefore, the critical sections exhibit moderate footprint,and as in high-level transactional programming languages [27],no manual annotation is performed. On the other hand,the workloads perform (non-transactional) memory allocation and I/O within critical sections.We use the RMS-TM benchmark suite to observe how Intel TSX-based synchronization fares in scenarios that are(1) either already optimized (through fine-grained locks) orare (2) not always friendly for transactional execution (i.e.,memory allocation and I/O within critical sections). Specifically, we disable the TM-MEM and TM-FILE flags to perform native memory management and file operations within transactional regions, and use the larger input set provided bythe benchmark. Figure 3 shows the results.We compare the speedup of Intel TSX (tsx) to fine-grainedlocking (fgl), relative to fine-grained locking with a singlethread. With fine-grained locking, RMS-TM workloads scalereasonably well. Using Intel TSX provides comparable performance, demonstrating that memory allocation and I/Owithin a transactional region do not require special handling,nor necessarily impact performance to a significant degree.As long as such a condition is detected early and the lockis acquired, system calls may not be a performance issue.We also observe that Hyper-Threading has less performanceimpact on Intel TSX, primarily because the data footprintsare moderate as compared to some STAMP workloads.Figure 3 also shows the performance when we use singleglobal lock (sgl) to synchronize all critical sections. Here,macros that mark critical sections are mapped to acquireand release a single global lock, instead. Therefore, the codesection that is being synchronized is the same as Intel TSX.Guarding the critical sections with fine-grained locks or asingle global lock does not make significant performance differences, except in fluidanimate with lots of small criticalsections, and utilitymine with more than 30% of executionspent in critical sections [16]. Here, single global lock failsto scale, while Intel TSX effectively exploits the parallelism,providing comparable performance to fine-grained locking.5.EVALUATION ONREAL-WORLD WORKLOADSIn this section, we apply and evaluate Intel TSX on a set ofreal-world workloads. These applications use different typesof synchronization mechanisms: lock-based critical sections,atomic operations, and lock-free data structures. ApplyingIntel TSX to the lock-based critical sections is straightforward. However, we modified the source code so that wecould also apply Intel TSX to code regions that use atomicoperations and lock-free data structures.For each workload, we start with a straightforward translation, and then consider optimizations to improve the performance of transactional synchronization.5.1WorkloadsTable 2 shows the workloads we use for this study. Theseworkloads cover various threading and synchronization schemes,and some represent computations typically found in the HPCdomain. In fact, physicsSolver and histogram were usedto stress test a throughput-oriented processor [24].Specifically, graphCluster is Kernel 4 of the SSCA2 benchmark [1]. The ssca2 workload in STAMP, in contrast, reimplements Kernel 1 for transactional memory from theground-up. graphCluster partitions a graph into clusterswhile minimizing edge cut costs. Vertices are observed inparallel, and based on the neighbors, they may be added/removed from the cluster. The original application uses pervertex locks to synchronize updates on the vertex status.ua is the Unstructured Adaptive workload from NAS Parallel Benchmarks suite [9]. To handle the adaptively refinedmesh, ua utilizes the Mortar Element Method [9], wherethread-local computations performed on collocation pointsare dynamically gathered (i.e., reduced) to mortars on a

Speedup overBaseline with 1 am1248canneal1248AVG.Figure 4: Intel TSX performance on real-world workloads.global grid. Since the grid dynamically changes, gathers oneach mortar require synchronization—the original application uses atomic operations. Reduced values are later scattered back to collocation points.physicsSolver iteratively resolves constraints between pairsof objects, computing the force exerted on each other to prevent inter-penetration. A key critical section updates the total force exerted on both objects in a given pair. Since eachobject may be involved in multiple pair-wise interactions,the original application acquires a pair of locks to resolveeach constraint, one lock for each object.nufft performs 3-D non-uniform FFT. We use the baseline version reported in [15]. Specifically, we focus on the adjoint NUFFT operator, which reduces a set of non-uniformlyspaced spectral indices onto a uniform spectral grid. Sincethe reduction combines an unpredictable set of non-uniformindices for each grid point, it requires synchronization. Theoriginal application uses an array of locks for this.histogram is an image histogram construction workload.Multiple threads directly update the shared histogram; thus,the updates require synchronization. The original application uses an atomic operation for each bin update. Whilesimple, histogram comprises the core compute of many HPCworkloads, such as the two-point correlation function in astrophysics [3], and radix sort [17].Lastly, canneal is a routing workload from PARSEC [2].It performs simulated annealing, where each thread tries torandomly swap two elements to improve solution quality. Toperform this swap in an atomic fashion, the original application imple

abort (e.g., system calls). Software using RTM instructions should not rely on the Intel TSX execution alone for for-ward progress. The fallback path not using Intel TSX sup-port must ensure forward progress, and it must be able to run successfully without Intel TSX. Additionally, the trans-actional path and the fallback path must co-exist without

Related Documents:

Optimizing for Latest Processors with Intel® Parallel ...

Intel C Compiler Intel Fortran Compiler Intel Distribution for Python* Intel Math Kernel Library Intel Integrated Performance Primitives Intel Threading Building Blocks Intel Data Analytics Acceleration Library Included in Composer Edition SCALE Intel MPI Library Intel Trace Analyze

35 Views

2y ago

Intel RealSenseTM Product Family D400 Series - FRAMOS

Document Number: 337029 -009 Intel RealSenseTM Product Family D400 Series Datasheet Intel RealSense Vision Processor D4, Intel RealSense Vision Processor D4 Board, Intel RealSense Vision Processor D4 Board V2, Intel RealSense Vision Processor D4 Board V3, Intel RealSense Depth Module D400, Intel RealSense Depth Module D410, Intel

27 Views

1y ago

Lenovo® ThinkCentre® M73 Desktop

Lenovo recommends Windows 8 Pro. SPECIFICATIONS PrOCESSOr OPErATING SySTEM I/O (INPUT/OUTPUT) POrTS Mini-Tower / Small Form Factor (SFF) Intel Core i7-4770S 65W Intel Core i7-4770 84W Intel Core i5-4430S 65W Intel Core i5-4430 84W Intel Core i5-4570S 65W Intel Core i5-4570 84W Intel Core i5-4670S 65W Intel Core i5-4670 84W Intel Core i3-4330 65W

29 Views

1y ago

800 35W/65W G3 Desktop Mini

HP recommends Windows 10 Pro. FormFactor Mini AvailableOperatingSystem AvailableProcessors Intel Core i5-6500 with Intel HD Graphics 530 (3.2 GHz, up to 3.6 GHz with Intel Turbo Boost, 6 MB cache, 4 cores); Intel Core i5-6500T with Intel HD Graphics 530 (2.5 GHz, up to 3.1 GHz with Intel Turbo Boost, 6 MB cache, 4 cores); Intel Core i7-6700 with Intel HD Graphics 530 (3.4

31 Views

2y ago

CloneCloud: Elastic Execution between Mobile Device and Cloud

Byung-Gon Chun Intel Labs Berkeley byung-gon.chun@intel.com Sunghwan Ihm Princeton University sihm@cs.princeton.edu Petros Maniatis Intel Labs Berkeley petros.maniatis@intel.com Mayur Naik Intel Labs Berkeley mayur.naik@intel.com Ashwin Patti Intel Labs Berkeley ashwin.patti@intel.com Abstract Mobile applications are becoming increasingly .

30 Views

2y ago

Intel® Select Solution for NFVI with Advantech Servers ...

Intel QAT Intel C620 series chipset with Intel QAT integrated on motherboard Tested with Intel C627 chipset. Ships with Intel C627 Chipset. Intel C627 and C626 chipset PCIe adapters are also available from Advantech (PCIE-3030NP and PCIE-3031NP) Storage Minimum 2 x 480 GB Intel SSD Dat

25 Views

2y ago

Troubleshooting No Display and No Boot Issues on Intel ...

Intel Core Duo Processor for Intel Centrino Duo Processor Technology Based on Mobile Intel 945 Express Chipset Family Datasheet Intel Core Duo Processor and Intel Core Solo Processor on ñ nm Process Datasheet Intel Pentium Dual-Core Mobile Processor Datasheet Intel

37 Views

2y ago

Benchmarking Contemporary Deep Learning Hardware and Frameworks: a ...

performance and thus better support for deep learning algorithms. In 2017, Intel released Intel Xeon Scalable processors, which includes Intel Advance Vector Extension 512 (Intel AVX-512) instruction set and Intel Math Kernel Library for Deep Neural Networks (Intel MKL-DNN) [10]. The Intel AVX-512 and MKL-DNN accelerate deep

11 Views

1y ago

Recent Views

Grammar as a Foreign Language - List of Proceedings

Grammar as a Foreign Language Oriol Vinyals Google vinyals@google.com Lukasz Kaiser Google lukaszkaiser@google.com Terry Koo Google terrykoo@google.com Slav Petrov Google slav@google.com Ilya Sutskever Google ilyasu@google.com Geoffrey Hinton Google geoffhinton@google.com Abstract Synta

2y ago

445 Views

Attention is All you Need - NIPS

Google Brain avaswani@google.com Noam Shazeer Google Brain noam@google.com Niki Parmar Google Research nikip@google.com Jakob Uszkoreit Google Research usz@google.com Llion Jones Google Research llion@google.com Aidan N. Gomezy University of Toronto aidan@cs.toronto.edu Łukasz Kaiser Google Brain lukaszkaiser@google.com Illia Polosukhinz illia .

1y ago

303 Views

GSA Implementation of Google (G) Suite

Google Meet Classic Hangouts Google Chat Google Calendar Google Drive and Shared Drive Google Docs Google Sheets Google Slides Google Forms Google Sites Google Keep Apps Script D

2y ago

316 Views

Google Drive (Google Docs, Google Sheets, Google Slides)

Google Drive (Google Docs, Google Sheets, Google Slides) Employees are automatically issued a Kyrene Google account. Navigate to drive.google.com. Use Kyrene email address and network password to login. Launch in Chrome browser for best experience. Google Drive is a cloud storage sys

2y ago

388 Views

Quick Guide of Using Google Home to Control Smart Devices

Configuration needs Google Home app. Search "Google Home" in App Store or Google Play to install the app. 3.1 Set up Google Home with Google Home app You can skip this part if your Google Home is already set up. 1. Make sure your Google Home is energized. 2. Open the Google Home app by tapping the app icon on your mobile device. 3.

1y ago

326 Views

Elaboração de Provas Online usando o Formulário Google Docs

2 Após o login acesse o Google Drive ou o Google Docs e selecione a ferramenta Google Forms (Formulários). Clique na caixa de Ferramentas do Google, localizada no canto direito superior da tela e selecione o Google Drive. Na tela do Google Drive clique em New , opção More e selecione Google Forms. OBS: É possível acessar o google

10m ago

123 Views

ACS WASC Templates

File upload, Folder upload, Google Docs, Google Sheets, or Google Slides. You can also create Google Forms, Google Drawings, Google My Maps, etc. Share with exactly who you want — without email attachments. Search or sort your list of files, folders, and Google Docs. Preview files and Google Docs.

2y ago

366 Views

Share a Google Doc in Schoology - fcps.edu

After you have connected your Google Drive to Schoology (directions in a separate handout), another way to share a Doc with students is to use the Google Drive Resource App. To share a Google Doc using the Google Drive Resources App: 1. From the Add Materials drop down menu, select Import from Resources. 2. Select Apps. Then Google Drive .

1y ago

92 Views

Google Drive - San Bernardino City Unified School District

Google Apps All of the Google applications that are available upon logging into Google.com (G , Gmail, Gphotos, Gdrive, etc.). Google Suite Google’s online cloud based office companion applications (Docs, Sheets, Slides). Google Drive Google’s online cloud storage and file sharing/collaboration application.

2y ago

378 Views

Single Sign On for Google Apps with NetScaler Unified Gateway

Google Apps for Work is a suite of cloud computing productivity and collaboration applications provided by Google on a subscription basis. It includes Google’s popular web applications including Gmail, Google Drive, Google Hangouts, Google Calendar and Google

2y ago

295 Views

Serviceteil

Google 84, 87, 124 Google 110 Google AdWords 101, 103 Google Alerts 127 Google Analytics 89 Google Maps 100, 110, 173 Google-Maps 63 Google Places 100, 103, 124 Graphiken 66 H Haftung 170 Haftungsausschluss 72 Hausfarbe 11 Headline 35 Heilmittelwerbegesetz 14, 69, 163 Heilversprechen 164 HONcode 78 HTML 58 HWG 31 I Imagefilm 31

2y ago

336 Views

Best practices for managing identities when you move to Google Cloud

Google Cloud. To provide t he informat ion an organizat ion would ne e d to transfer data and ownership from one Google Account to anot her for s ome of t he noncore Google s er vice s, such as Google Ads, Google Analyt ics, or DV360. Intende d audience Organizat ion administrators. Sta planning Google Cloud / Google Wor kspace migrat ion. Key .

1y ago

481 Views

Google Analytics 101 - Content Jam

Google Analytics 101 201 301 Google Ads 101 201 Google Tag Manager 101 Google Data Studio 101 Google Optimize 101. Welcome Fun Facts: Share . Google Analytics 301 35 Web Property The web property ID is of the form UA-XXXXXX-YY. It's often called the "UA number" since it starts with

1y ago

107 Views

Introduction - Google Earth User Guide

Google Earth Community: Learn from other Google Earth users by asking questions and sharing answers on the Google Earth Community forums. Using Google Earth: This blog describes how you can use some of the interesting features of Google Earth. Selecting a Server Note: This section is relevant to Google Earth Pro and EC users.

3y ago

288 Views

Using Google Forms to Manage Officials Signups

Google Sheets, deleting a response from the form or sheet will not affect the other. Once the Google Form is linked to a Google Sheet, clicking on the spreadsheet icon will open the linked Google Sheet. Google Responses Sheet Google automatically creates and populates the sp

2y ago

276 Views

Performance Evaluation Of Intel R Transactional Synchronization .

It looks like you're using an ad-blocker