New Performance Updates

New Performance Updates

Virginia POWER User Group May 19, 2015 Whats New Performance Features for IBM PowerVM & POWER8 Steve Nasypany [email protected] Copyright IBM Corporation 2015 Technical University/Symposia materials may not be reproduced in whole or in part without the prior written permission of IBM. General Performance News 2015 IBM Corporation 2 Optimization Redbook Draft available now! POWER7 & POWER8 PowerVM Hypervisor AIX, i & Linux Java, WAS, DB2 Compilers & optimization Performance tools & tuning 2015 IBM Corporation 3 Quick View of POWER8 POWER8 Migration & Best Practices SAP, Oracle, Siebel results linked here IBM Power Systems Performance Report

POWER8 Single-Thread, SMT2, SMT4 & SMT8 numbers! Per report Uplift from SMT2 to SMT4 is 30% Uplift from SMT4 to SMT8 is 7% Uplift from Single-Thread to SMT8 is 100% Per SMT thread vs throughput should be very linear as threads are more equally biased in POWER8 (covered later) 2015 IBM Corporation 4 Dynamic System Optimizer The Dynamic System Optimizer function in AIX is not supported on POWER8 AIX function, formerly called Active System Optimizer (aso) daemon function available for free Additional charged features for Autonomic Large Page (16 MB) Migration Autonomic Processor Pre-Fetch Control AIX asoo commands will not execute anything on POWER8 If you migrate from POWER7 with it enabled, it will remain enabled, but aso daemon will not do anything No performance concern, but can disable if you find the aso logs annoying Future support is based on two issues Benefit of DSO was not judged a high-priority for Scale Out systems Functional support is not as much a technical issue as a testing resources issue Lab is interested in feedback on customers who want Scale Up support for POWER8. Complain to CTS or me and I will forward to development This has no impact on Dynamic Platform Optimizer DSO optimizes threads within an virtual machine (OS) instance DPO optimizes virtual machine placement within a frame 2015 IBM Corporation 5 Java Java 7.1 SR1 is the preferred level for POWER7 and POWER8 Java 6 SR7 is the minimally recommended level for POWER7, as it contains optimizations for POWER7 and the default use of 64KB (versus

4KB) pages for Java Virtual Machines (JVM) in AIX Java 7.1 is optimized to use specific hardware optimizations for POWER8 JIT compiler will automatically detect platform architecture and generate code optimized for that platform. WAS RHEL 6, SLES 11 Linux support use of 64KB pages for JVMs As with all legacy levels, Java applications with little memory footprint typically perform better in 32-bit. Applications with larger memory requirements should use 64-bit. A variety of other Java optimizations for AIX & Linux are covered in Section 8.3 of the Performance Optimization & Tuning Techniques for IBM Processors, including IBM POWER8 Redbook 2015 IBM Corporation 6 Utilization, Simultaneous Multithreading & Virtual Processors 2015 IBM Corporation 7 Review: POWER6 vs POWER7/8 SMT Utilization POWER5/6 utilization does not account for SMT, POWER7/8 is calibrated in hardware POWER6 SMT2 SMT2 POWER7 SMT4 SMT4 POWER8

SMT8 Htc0 busy Htc0 busy Htc0 busy Htc0 busy Htc0 busy Htc1 idle Htc1 idle Htc1 idle Htc1 idle

Htc1 idle Htc2 idle Htc2 idle Htc2 idle Htc3 idle Htc3 idle Htc3 idle 100% busy ~70% busy ~63% busy ~60% busy

Htc4 idle Htc5 idle Htc6 Htc7 busy = user% + system% idle idle ~56% busy Simulated single threaded process on 1 core, 1 Virtual Processor, utilization values change. In each case, physical consumption can be reported as 1.0. Real world production workloads will involve dozens to thousands of threads, so users may not notice any difference in the macro scale See Simultaneous Multi-Threading on POWER7 Processors by Mark Funk 2015 IBM Corporation 8 POWER6 vs POWER7/POWER8 Dispatch POWER7/8 SMT4 POWER6 SMT2 Htc0 busy Htc1

busy ~80% busy Htc0 busy Htc1 idle Htc2 idle Htc3 idle Activate Virtual Processor ~50% busy There is a difference between how workloads are distributed across cores in POWER7 & POWER8 In POWER5 & POWER6, the primary and secondary SMT threads are loaded to ~80% utilization before another Virtual Processor is unfolded In POWER7, all of the primary threads (defined by how many VPs are available) are loaded to at least ~50% utilization before the secondary threads are used. Once the secondary threads are loaded, only then will the tertiary threads be dispatched. This is referred to as Raw Throughput mode. Why? Raw Throughput provides the highest per-thread throughput and best response times at the expense of activating more physical cores 2015 IBM Corporation 9

Review: POWER6 vs POWER7/8 Dispatch proc0 proc1 proc2 proc3 Primary POWER6 Secondary proc0 proc1 proc2 proc3 Primary POWER7 POWER8 (Raw Mode) Secondary Tertiaries lcpu 0-3 63% 77% 88% 100% lcpu 4-7

100% 63% 77% 88% lcpu 8-11 100% 63% 77% 88% lcpu 12-15 100% 63% 77% 88% Once a Virtual Processor is dispatched, the Physical Consumption metric will typically increase to the next whole number Put another way, the more Virtual Processors you assign, the higher your Physical Consumption is likely to be in POWER7/POWER8 2015 IBM Corporation 10 POWER7/POWER8 Consumption POWER7/POWER8 will activate more cores at lower utilization levels than earlier architectures when excess VPs are present Customers may complain that the physical consumption metric (reported as physc or pc) is equal to or possibly even higher after migrations from earlier architectures Every POWER7/POWER8 customer with this complaint to also have significantly higher idle% percentages over earlier architectures Consolidation of workloads and may result in many more VPs assigned to a new POWER7 or POWER8 partition Just because we let you set very high ranges of Virtual Processor to Entitlement (20:1 now on some POWER7+ and POWER8) does not

mean that is always optimal. Your choices have consequences. There is no magic ratio for all environments. If you want more education on VP vs Entitlement, ask for that education. More VPs can result in lower affinity Broader spread across shared pool and memory domains Lower affinity leads to more cycles, more cycles leads to lower perf 2015 IBM Corporation 11 Virtual Processor Dispatch A recurring question in AIX is how many Virtual Processors am I using? The physical consumption metric (physc or pc) could be used to approximate activity if the VP Folding algorithm was understood and the workload was stable (typically, 1 to 2 VPs higher than physc) Tools like sar, mpstat and nmon could be used to display logical CPUs and divine how many Virtual Processors were active by looking at SMT sets (mapping to a VP) and their logical CPU statistics (utilization and context switches) A new mpstat option provides information on Virtual Processor activity mpstat v Displays the delta Virtual Timebase (VTB), which is time charged to a dispatched VP If the Virtual Timebase is 0, the processor statistics associated with that VP will not be shown, simplifying the output AIX 7.1 TL3 SP2 2015 IBM Corporation 12 Virtual Processors Dispatched - mpstat -v vcpu lcpu --- us sy

---- ---- ----- ----- ----- ----- 55.88 0.53 0.00 43.59 0.34[ 56.4%] 0.60[119.7%] 649 0 55.88 0.52 0.00 0.47 0.34[ 56.4%]

0.34[ 56.9%] - 1 0.00 0.00 0.00 13.95 0.00[ 0.0%] 0.08[ 13.9%] - 2 0.00 0.00 0.00 15.04 0.00[ 0.0%] 0.09[ 15.0%] -

3 0.00 0.01 0.00 14.13 0.00[ 0.0%] 0.08[ 14.1%] - 56.26 0.92 0.00 42.82 0.07[ 57.2%] 0.13[ 25.5%] 209 4 56.26 0.87

0.00 1.28 0.07[ 57.1%] 0.07[ 58.4%] - 5 0.00 0.04 0.00 14.11 0.00[ 0.02[ 14.1%] - ---- 0 4 wa id pbusy pc

0.0%] VTB(ms) ------- 0.00 13.69 0.00[ 0.0%] VCPU6 values0.00 appears0.01 to be tied to lowest logical CPU number.0.02[ In this14.8%] Example there 7are only0.00 3 active0.01 VPs and VCPU13.75 does not0.00[ represent some internal AIX 0.00 0.0%] 0.02[ 13.9%] numbering scheme 8 60.92 2015 IBM Corporation 0.50 0.00

38.58 0.15[ 61.4%] 0.25[ 49.0%] 404 13 Migration Guidance 2015 IBM Corporation 14 Migrations: Dispatching, SMT Will I have a problem? If you are migrating between POWER7 and POWER8 Not a problem AIX SMT4 default makes these migrations apples-to-apples Default dispatcher behaves the same If you are migrating between POWER5/POWER6 to POWER8? Maybe a problem POWER7 & POWER8 behave the same way Now that you understand the dispatch behavior, you know why customers may complain What are my options? Get the VP counts right the first time. Do not do 1:1 VP sizings for larger partitions between POWER5/6 and POWER7/8. This will get you into trouble! If a customer ignores updated VP sizings, consider using Scaled Throughput tunings Use Scaled Throughput tunings AIX uses more SMT threads before dispatching a VP. See backup material for detail and guidance. 2015 IBM Corporation 15

POWER8 SMT Default: Why SMT4? AIX 6.1 will only support SMT4. Most customers are still running AIX 6.1 After early experiences with POWER7, AIX chose the conservative path for POWER8 at the expense of some capacity Most workloads will be fine with SMT4 or SMT8 All those problems you thought were SMT issues in POWER7 werent. They were firmware, affinity, aggressive dispatcher, too many VPs. We avoid application scalability issues made visible by more SMT threads, but often blamed incorrectly on SMT Lab view is most customers do not run at utilization levels (> 80%) to benefit from SMT8. The reality is, many, if not most of our customers do not run at utilization levels to fully exercise SMT4. SMT4 is the best of all worlds for now, but there are now more options to exploit SMT. This can be done via the Scaled Throughput tunings which are covered in the backup material 2015 IBM Corporation 16 POWER8 SMT: Should I use SMT8? Any PoC or benchmark where youre going to drive to 80% utilization Absolutely try SMT8, dont leave capacity on the table You cant get to the highest rPerf without SMT8 OLTP DB, large WAS appservers, etc have seen 5 to 15% increases We should be open to letting experienced customers trying SMT8 These customers typically know what theyre doing and understand if higher SMT is appropriate for their environment It is easy and free to test SMT4 and SMT8 modes, no reboot For new customers/applications, need to review software stack If application space is will known on AIX, should not be a problem If application new to AIX or Linux, should be tested for scaling issues (product may have never been tested to 24 cores / 192 logical cpus) 2015 IBM Corporation 17

POWER8 SMT: Flexible SMT POWER7 & POWER8 are different in SMT bias In POWER7, there is a correlation between the Hardware Thread number (logical CPU 0, 1, 2 & 3) and physical resources within the processor. Lower threads may also have a higher priority. POWER8 Hardware Threads are equally biased and provide the same performance regardless of which thread is active. This is true for AIX & Linux. For AIX, you do not need to worry about using bindprocessor or RSET function with various threads, or always pinning to a Virtual Processors Primary Hardware Thread for the best performance. This topic, called Flexible SMT, is covered in more detail in Section 4.2 of the tuning Redbook AIX will dynamically adjust between SMT and ST mode based on the workload utilization. A 1:1 equivalent in Linux does not really exist, but I expect similar function will migrate to Linux and/or PowerKVM local_near_far_memory_part_4_aggressive_intelligent_threads46?lang=en 2015 IBM Corporation 18 POWER8 SMT Opinion: What about Linux? The Linux space is a bit more complicated As of right now, there does not appear to be seamless handling of SMT between all Linux distros and PowerKVM comparable to PowerVM hosting AIX and i. Most Linux workloads are more scale out than scale up Smaller partitions More HPC-like, manual SMT tunings, manual bindings to processors IBM and the industry is working on this SMT can be dynamically changed Distros have added more SMT awareness, NUMA tooling (numastat) Visibility of SMT through host & client layers may differ in distros Split-Core function offered where a single core with SMT8 will be split into four SMT2 cores from the guest perspective Rely on guidance provided by the Linux OS and application space. LTC is very responsive at DeveloperWorks Community questions:

2015 IBM Corporation 19 Migrating Memory & Storage I/O If your environment has been memory constrained, consider profiling existing workloads for Advanced Memory Expansion We are getting many field questions about this feature in 2015 AIX amepat tool can profile running workloads Generates output report with guidance on recommended expansion factors and CPU use required to implement Can select target architecture of POWER7 or POWER8 Supported on AIX 6.1 with POWER6 and above For storage I/O, use existing tools, knowledge base for planning Ask for a Disk Magic study Use documents/tools at IBM Techdocs Search for documents on POWER8 or written by Dan Braden, Sue Baker, John Hock For example, the updated Fibre Channel Planning tool estimates adapters required based on IOPS, MB/sec, paths and LUN counts (should work fine for System i & AIX) 2015 IBM Corporation 20 Migrating Network & General Tuning For all network efforts, see Steve Knudsons and/or Alexander Pauls presentations (10 Gb SEA tuning, SR-IOV, etc) High packet counts (>100K/sec) or low-latency tiny-packets require tuning Learn about mtu_bypass Beware using Large Receive & Send on VIOS with Linux clients Linux does not support this feature (LTC is trying!) Mixing AIX/i clients with Linux virtual ethernet/SEA will result in

performance issues Separate Linux clients See Labs Performance Tuning Best Practices links Single sheets for POWER7 & POWER8 Transition and Service Strategy guidance All at: 2015 IBM Corporation 21 Scaled Throughput 2015 IBM Corporation 22 What is Scaled Throughput? Scaled Throughput is an alternative to the default Raw AIX scheduling mechanism An alternative for some customers at the cost of some performance Not an alternative to addressing AIX and pHyp defects, partition placement issues, realistic entitlement settings and excessive Virtual Processor assignments Will dispatch more SMT threads to a VP/core before unfolding more VPs It can be considered to be more like the POWER6 folding mechanism, but this is a generalization, not a technical statement Supported on POWER7/POWER7+, AIX 6.1 TL08 & AIX 7.1 TL02 Does not apply to dedicated partitions unless you enable VP folding Raw vs Scaled Performance Raw provides the highest per-thread throughput and best response times at the expense of activating more physical cores Scaled provides the highest core throughput at the expense of per-thread response times and throughput. It also provides the highest system-wide throughput per VP because hardware thread capacity is not left on the table. 2015 IBM Corporation 23

Raw vs Scaled proc0 proc1 proc2 proc3 Primary Raw Secondary default Tertiaries lcpu 0-3 63% 77% 88% 100% lcpu 4-7 100% 63% 77% 88% lcpu 8-11 100% 63% 77% 88%

lcpu 12-15 100% 63% 77% 88% proc0 proc1 proc2 proc3 proc0 proc1 proc2 proc3 Scaled Mode 2 Scaled Once a Virtual Processor is dispatched, physical consumption will typically increase to the next whole number Mode 4

POWER8 Mode + AIX 7.1 Supports Scaled Mode 8 2015 IBM Corporation 24 Scaled Throughput: Tuning Tunings are not restricted, but you can be sure that anyone experimenting with this without understanding the mechanism may suffer significant performance impacts Dynamic schedo tunable Actual thresholds used by these modes are not documented and may change at any time schedo p o vpm_throughput_mode= 0 Legacy Raw mode (default) 1 Scaled or Enhanced Raw mode with a higher threshold than legacy 2 Scaled mode, use primary and secondary SMT threads 4 Scaled mode, use all four SMT threads 8 Scaled mode, use eight SMT threads (POWER8, AIX 7.1 required) Tunable schedo vpm_throughput_core_threshold sets a core count at which to switch from Raw to Scaled Mode Allows fine-tuning for workloads depending on utilization level VPs will ramp up quicker to a desired number of cores, and then be more conservative under chosen Scaled mode 2015 IBM Corporation 25 Scaled Throughput: Guidance Workloads Workloads with many light-weight threads with short dispatch cycles and low IO (the same types of workloads that benefit well from SMT) Customers who are easily meeting network and I/O SLAs may find the tradeoff between higher latencies and lower core consumption attractive Customers who will not reduce over-allocated VPs and prefer to see POWER6 behavior Use mpstat (-v) in AIX 7.1 TL3 to view Virtual Processor dispatches Performance It depends, we cant guarantee what all workloads will do Mode 1 may see little or no impact but higher per-core utilization with lower physical

consumed (typically 10-15%) Workloads that do not benefit from SMT and use Mode 2 or Mode 4 will see double-digit per-thread performance degradation (higher latency, slower completion times) POWER6 workloads migrating to POWER7 or POWER8 and using Mode 2 will likely perform as well, or better and minimize complaints about higher than expected physical consumption. Many POWER7 workloads could migrate to POWER8 mode 2 and reduce core usage without performance impact. These are non-restricted dynamic tunings, easily tested like SMT mode changes 2015 IBM Corporation 26 Raw Throughput: Default and Mode 1 Scaled Throughput: Mode 1 Raw Throughput 12 12 11 11 10 10 9 9 8 8 7

7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 Time Active_Threads Active_VP Time Phys_Busy

Phys_Consumed AIX will typically allocate 2 extra Virtual Processors as the workload scales up and is more instantaneous in nature VPs are activated and deactivated one second at a time 2015 IBM Corporation Active_Threads Active_VP Phys_Busy Phys_Consumed Mode 1 is more of a modification to the Raw (Mode 0) throughput mode, using a higher utilization threshold and moving average to prevent less VP oscillation It is less aggressive about VP activations. Many workloads may see little or no performance impact 27 Scaled Throughput: Modes 2 & 4 Scaled Throughput: Mode 4 Scaled Throughput: Mode 2 12 12 11 11 10

10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0

0 Time Time Active_Threads Active_VP Phys_Busy Phys_Consumed Mode 2 utilizes both the primary and secondary SMT threads Somewhat like POWER6 SMT2, eight threads are collapsed onto four cores Physical Busy or utilization percentage reaches ~80% of Physical Consumption 2015 IBM Corporation Active_Threads Active_VP Phys_Busy Phys_Consumed Mode 4 utilizes both the primary, secondary and tertiary SMT threads Eight threads are collapsed onto two cores Physical Busy or utilization percentage reaches 90100% of Physical Consumption 28 Tuning (other) Never adjust the legacy vpm_fold_threshold without L3 Support guidance

Remember that Virtual Processors activate and deactivate on 1 second boundaries. The legacy schedo tunable vpm_xvcpus allows enablement of more VPs than required by the workload. This is rarely needed, and is over-ridden when Scaled Mode is active. If you use RSET or bindprocessor function and bind a workload To a secondary thread, that VP will always stay in at least SMT2 mode If you bind to a tertiary thread, that VP cannot leave SMT4 mode POWER8 threads are more balanced whereas lower POWER7 threads typically have a higher priority. These functions should only be used to bind to primary threads unless you know what you are doing or are an application developer familiar with the RSET API Use bindprocessor s to list primary, secondary and tertiary threads 2015 IBM Corporation 29

Recently Viewed Presentations

  • Symbolism and Motif - Ms. Davis's Webpage!

    Symbolism and Motif - Ms. Davis's Webpage!

    Motif Example. The central idea of the co-existence of good and evil in Harper Lee's To Kill a Mockingbird is supported by several motifs. Lee strengthens the atmosphere by a motif of Gothic details i.e. recurrent images of gloomy and...


    Langer & Rodin (1976). The effects of choice and enhanced personal responsibility for the aged. J . Pers. Soc. Psyc. Treatment group given control over things such as rising time, bedtime, choice of movie shown in the evening.
  • 3D Shape + Addition  LS  To identify properties

    3D Shape + Addition LS To identify properties

    LS - To identify properties of 3D Shape. To solve addition calculations involving the properties of 3D shape.
  • Chapter 8 and 9 notes

    Chapter 8 and 9 notes

    Sharing of electrons is another way that these atoms can acquire the electron configuration of noble gases. Remember Chapter 6 says the octet rule states that atoms lose, gain or share electrons to achieve a stable configuration of eight valence...
  • Systems Modeling Language (SysML) - ArtistDesign NoE

    Systems Modeling Language (SysML) - ArtistDesign NoE

    De-facto standard within the software community Robust and extensible language to adapt to SE needs OMG Infrastructure Broad international and industry representation Defined adoption process to evolve UML Availability of tool vendor and training support Unified Modeling Language UML Is...
  • Personal Property Ad Valorem Appraising

    Personal Property Ad Valorem Appraising

    USPAP STANDARD 6 MASS APPRAISAL DEVELOPMENT AND REPORTING is being re-written by the Standards Foundation as we speak. One Standard for appraisal Development, another for appraisal reporting. Luckily, we have USPAP STANDARDS 7 & 8 for personal property.
  • National Action Plan to Improve Health Literacy: Unified ...

    National Action Plan to Improve Health Literacy: Unified ...

    national action plan to improve health literacy: unified health literacy goals and strategies. two principles: 1. all people have the right to health information that helps them make informed decisions; 2. health services should be delivered in ways that are...
  • Return to Work Programs Bureau of Workers Comp

    Return to Work Programs Bureau of Workers Comp

    Focus on person's capabilities-not disabilities. Commit to return worker to original job. ... A "Notice of Ability to Return to Work" must be given promptly upon the receipt of medical evidence showing the employee is able to return to work....