Tuning a runtime for productivity and performance

Tuning a runtime for productivity and performance

Tuning a Runtime for Both Productivity and Performance Mei-Chin Tsai Jared Parsons What is a runtime? Windows/X86 Translate Windows/X64 Linux/X86

Linux/X64 MetaData/IL Windows/ARM Linux/ARM Windows/ARM64 Linux/ARM64 1. Tuning startup and throughput 2. Latency case study

3. Takeaways Services of runtime to execute code TypeSystem Object layout and vtable layout type casting (correctness) Just-in-Time compiler (JIT) Convert IL to native code Garbage Collector (GC) Cleaning up managed heap when needed class MyBase

{ public int baseField; //. } class MyClass : MyBase { public int myField; // . public virtual int myFunc() { int result = myField + baseField; return result; } }

Question - How many methods are JITed to run this HelloWorld Console application? using System; class Program { static void Main(string[] args) { Console.WriteLine("Hello World!"); } } Answer 243 method on .NET Core 2.1 AppDomain:SetupDomain(bool,ref,ref,ref,ref):this AppDomain:SetupFusionStore(ref,ref):this

AppDomainSetup:SetupDefaults(ref,bool):this StreamWriter:set_AutoFlush(bool):this StreamWriter:Flush(bool,bool):this CultureInfo:get_InvariantCulture():ref CultureInfo:.cctor() CultureData:get_Invariant():ref Program: Main(ref) Console.WriteLine() Enumerator:MoveNext():bool:this Enumerator:get_Current():ref:this Dictionary`2:TryInsert(ref,ref,ubyte):bool:this

Dictionary`2:Initialize(int):int:this StringBuilder:Append(ref):ref:this StringBuilder:Append(ushort):ref:this StringBuilder:Remove(int,int):ref:this RuntimeType:MakeGenericType(ref):ref:this RuntimeType:get_IsGenericTypeDefinition():bool:this RuntimeType:GetGenericArguments():ref:this String:LastIndexOfCharArray(ref,int,int):int:this String:InitializeProbabilisticMap(long,struct) String:ArrayContains(ushort,ref):bool String:Substring(int,int):ref:this

CompareInfo:Compare(ref,ref,int):int:this CompareInfo:CompareString(struct,struct,int):int:this CompareInfo:GetNativeCompareFlags(int):int Question - How many methods are JITed to run this HellowWorld Web API application? public class ValuesController { [HttpGet("/")] public string Hello() => "Hello World!"; [HttpGet("/api/values")] public IEnumerable Get() { return new string[] { "value1", "value2" }; }

[HttpGet("/api/values/{id}")] public string Get(int id) { return "Your value is " + id; } } https://github.com/dotnet/corert/tree/master/samples/WebApi Simple HelloWorld WebApi sample Configuration NET471 JIT Hot

Startup 1.38s Cold Methods Startup JITed 1.78s 4,417 Kbytes JITed 392kb JIT Time (Hot)

1.49s We have a problem here Measure.. Measure.. Measure.. HelloWorld WebApi takes 1.38s to run We are asked to JIT over 4000 methods Type System JIT JITEEInterface

Calling engineer in action Precompile on targeted device Execution and compilation environment is always matched NGEN Cache the JIT and TypeSystem result at deploy time Remove majority of JIT from program execution Program execution just running compiled code Simple HelloWorld WebApi sample Configuration NET471

Hot Startup Cold Methods Startup JITed Kbytes JITed JIT Time (Hot) JIT

1.38s 1.78s 4,417 392kb 1.49s Ngen for FX 0.48s

1.02s 1,153 81kb 0.42s Start up is now 0.48 second. Not bad! We are done! Fragility JIT / TypeSystem output depends on Layout of code in the application and framework Data structures within the CLR

This is fragile and causes precompiled images to be invalidated .NET Framework is serviced via Windows Update Application dependencies update Performance can change after deployment Engineer was happy for a while! The world changes on you.. Devices where battery life matters Build once and deploy on millions of servers Sorry but we dont trust you. Security those executable on disk should be signed No admin service allowed on servers, locked down devices, or Linux

Compile once at build lab Need to deal with mismatch between compilation and execution Scale back of caching dont layout types till execution Scale back code optimization such as inlining, de-virtualization CrossGen.exe Generate less performing code With version resilience and copy-deployable C#: public static void GenDoTest(GenBaseClass o, string exp) { Debugger.Break(); string res = o.ToString(); Crossgen codegen:

if(exp != res) throw new Exception(); push rdi } push rsi NGEN codegen: push rdi push rsi sub rsp,28h mov rsi,rdx mov rdi,r8 call CLRStub[ExternalMethodThunk]@7ffd363160c0 (00007ffd`363160c0) mov rcx,rsi mov rax,qword ptr [rsi] mov rax,qword ptr [rax+40h]

call qword ptr [rax+20h] => goes directly to target mov rdx,rax mov rcx,rdi call CLRStub[ExternalMethodThunk]@7ffd363160c8 (00007ffd`363160c8) test eax,eax je 00007ffd`3631ca60 add rsp,28h pop rsi pop rdi ret push rbx sub rsp,30h mov qword ptr [rsp+28h],rcx

mov rsi,rcx mov rdi,rdx mov rbx,r8 call qword ptr [MyRepro_ni+0x1178 (00007ffd`36421178)] mov rcx,rsi call qword ptr [MyRepro_ni+0x1038 (00007ffd`36421038)] mov rcx,rdi mov r11,rax cmp dword ptr [rcx],ecx call qword ptr [rax] => goes through VSD stub (5 more instructions before hitting target) mov rdx,rax mov rcx,rbx call qword ptr [MyRepro_ni+0x1180 (00007ffd`36421180)] test al,al jne 00007ffd`36424216

add rsp,30h pop rbx pop rsi pop rdi ret Simple HelloWorld WebApi sample Configuration .NET 4.7.1 JIT CoreCLR Hot Cold Methods Startup Startup

JITed IL Code JITed JIT Time 1.38 s 1.78 s 4,417 392 kB

1.49 s NGen 0.48 s 1.02 s 1,153 81 kB 0.42 s JIT

1.02 s 1.35 s 3,521 302 kB 0.99 s NGEN for CoreLib NGEN+CrossGen for FX

NGEN+CrossGen all 0.60 s 1.09 s 1,961 147 kB 0.54 s 0.47 s

0.94 s 1,235 94 kB 0.35 s 0.26 s 0.75 s 293 28 kB

0.08 s https://github.com/dotnet/corert/tree/master/samples/WebApi ??? Engineer Performance team How about throughput? Number collected using .NET Core 2.1 on a machine with 8 cores (Xeon Core i7 2GHz) and 32GB of RAM Configuration

JIT Json Serialization Benchmark (Requests/sec) - Higher is better 123,000 Default only Shared FX is CrossGen 120,000 CrossGen Shared FX and application 115,000

Oops! We push the problem to elsewhere. Code generation technology choices Ahead-of-time generation (CrossGen) Interpreter JIT Minimum optimizations Maximum optimizations Tiered Compilation Generate code multiple times for a single method Method bodies have a versioning story Generate with minimum optimizations at startup Replace with higher optimized code at steady state

Use CrossGen to avoid generation at all for most methods Method has CrossGen no yes Minimum Optimization Optimized JIT JIT CrossGen Optimizedcode

JIT Runtime monitoring Becoming hot? Heuristic of the tiering Steady state vs. startup is a gray area How to determine hot methods? Hit count to trigger fully optimized JIT Or use sample profiling to trigger fully optimized JIT Other potential future heuristic Presently using hit count of 30

Measure again Configuration Json Serialization Benchmark (Requests/ sec) Higher is better Default (shipping) only Shared FX is CrossGen All CrossGen 120,000 All JIT

123,000 CrossGen+Tiered JIT 123,000 115,000 Recap on our codegen journey Pure JIT CrossGen NGEN

More capability in runtime for code optimization Tier JITting 2. Latency case study What is acceptable latency? Bing query is less 1 second end-to-end user experience HoloLens is 60 frames per second. Multi-player real-time online gaming Age of Ascent, Illyriad Games Demo: Age of Ascent When the right flavor of GC was not available

Case study Bings migration to .NET Core 2.1 Bing clocks a 34% improvement in internal server latency over previous version after .NET Core migration Three prongs tuning Tune runtime determinism Performance feature to enable building leaner framework Data driven framework optimization Tune runtime determinism Enable server GC on Linux to eliminate the long pause

GetWriteWatch Windows API not available on Linux Software Write Watch for Concurrent GC Deploy CrossGen for first respond delay Performance feature to enable building leaner framework Span and Memory A feature that cut across language, runtime, and framework Applying the feature through framework Enable partner to rearchitect their programs Data driven targeted framework optimization Vectorization of string.Equals

Methods with calli are now inline-able Devirtualization for EqaulityComparaer.Default Improve performance of String.IndexOfAny for 2&3 char searches public int IndexOfAny (char[] anyOf); 3. Takeaways No silver bullet for performance Be data driven Design for performance Tune for performance It may require many small work Performance Is hard

Is on-going Is a priority Understand your requirement ahead of time Monitor and revalidate Questions? Mei-Chin Tsai [email protected] Github.com/MeiChin-Tsai Jared Parsons [email protected] twitter.com/jaredpar Github.com/jaredpar

AMA .NET/.NET Core AMA 11/7 2:55pm Ballroom C Links https://blogs.msdn.microsoft.com/dotnet/2018/08/20/bing-com-runs -on-net-core-2-1 https://blogs.msdn.microsoft.com/dotnet/2018/04/18/performance-i mprovements-in-net-core-2-1/ https://blogs.msdn.microsoft.com/dotnet/2018/08/02/tiered-compila tion-preview-in-net-core-2-1/

Recently Viewed Presentations

  • Cybersecurity for the Job Seeker Created by John

    Cybersecurity for the Job Seeker Created by John

    KRACK: ("Key Reinstallation Attack") is an an attack on the Wi-Fi Protected Access protocol that secures Wi-Fi connections. Prevented by keeping firmware up to date, including IoT devices and using Ethernet. ... Bitdefender. Keep your anti-virus up-to-date! 3. Use Strong...
  • Agenda Item No. 15 KSCs cooperation with INTOSAI

    Agenda Item No. 15 KSCs cooperation with INTOSAI

    IDI is already partnering with all of the INTOSAI regional organizations to identify the portfolio of SAI capacity development needs and to design and deliver programs to address those needs. ... Should each Region group be encouraged to have Knowledge...
  • Considering the Range of Decision-Relevant Effects

    Considering the Range of Decision-Relevant Effects

    Emotional, cognitive, and behavioral effects: Results of testing may lead to lifestyle changes, depression, or risky behavior. Similar effects are expected for both tests. Legal and ethical effects: There are potential legal consequences if the patient's profession requires disclosure of...
  • Industrialization and Social Change in the 18th and 19th Century

    Industrialization and Social Change in the 18th and 19th Century

    Industrialization and Social Change in the 18th and 19th Century. Jace Heller. Brett Berger. ... Thomas Malthus argued that population would always tend to grow faster than food supply ... .Some had to work only if the family was in...
  • Defining the Cyber Domain for Wireless Communications Security

    Defining the Cyber Domain for Wireless Communications Security

    Analyze Ohm's law in a three-wire Y-Y three phase circuit and in a four-wire YY three phase circuit using a basic three phase generator that produces three balanced voltages which are connected to balanced loads. Analyze Kirchhoff's current law in...
  • Feedback Systems and Driving Clinton Matney AT Workshop

    Feedback Systems and Driving Clinton Matney AT Workshop

    The motor EMC uses is a set speed - motor at full power will only turn so fast, I can turn the wheel/joystick faster than the motor will turn. This is not something that would happen in normal driving. Why...
  • World War I - nawhiting.com

    World War I - nawhiting.com

    World War I 1914-1918. Causes of the war. Technology of the war. Military techniques / Battles. War at Home "Total War" US / Russia and the end of the war
  • ASNMU Board of Trustees End of the Year Report

    ASNMU Board of Trustees End of the Year Report

    Arial Default Design 1_Default Design Board of Trustees Report Summer Goals Summer Send-Off Concert Gearing up for Fall 2007! Referendum Wildcat Shuttle Anna Marie Cream Childcare Fund USA Today Readership Program After Hours Study Lounge Other ASNMU Points of Interest...