Performance Improvements in .NET 8 — JIT partial translation
Related video Dynamic PGO
Benchmark settings
In this article, I include microbenchmarks to highlight various aspects of the discussion. Most of these benchmarks were implemented using BenchmarkDotNet v0.13.8, and unless otherwise noted, each benchmark has a simple setup.
To follow this article, first make sure you have .NET 7 and .NET 8 installed. For this article, I used the .NET 8 Release Candidate (8.0.0-rc.1.23419.4).
After completing these prerequisites, create a new C# project in the new baseline directory:
dotnet new console -o benchmarks
cd benchmarks
This directory will contain two files: benchmarks.csproj (the project file that contains information about how the application should be built) and Program.cs (the application’s code). Replace the entire contents of benchmarks.csproj with the following:
Exe
net8.0;net7.0
Preview
enable
true
true
The above project files tell the build system that we want:
- Build a runnable application (not a library);
- Ability to run on .NET 8 and .NET 7 (so that BenchmarkDotNet can run multiple processes, one with .NET 7 and one with .NET 8, to be able to compare results);
- Even if C# 12 has not yet been officially released, you can still use all the latest features of the C# language;
- Automatically import common namespaces;
- Ability to use the unsafe keyword in code;
- And configure the garbage collector (GC) to its “server” configuration, which affects the trade-off it makes between memory consumption and throughput (this is not strictly necessary, I’m just used to using it, and it is the default setting for ASP.NET applications).
The final extracts BenchmarkDotNet from NuGet so that we can use the library in Program.cs. (A few benchmarks require the addition of additional packages; I’ve noted these where applicable.)
For each benchmark, I have included the complete Program.cs source code; just copy and paste that code into Program.cs, replacing its entire contents. Within each test, you’ll notice that you can apply multiple properties to the Tests class. The [MemoryDiagnoser] attribute indicates that I want it to track managed allocations, the [DisassemblyDiagnoser] attribute indicates that I want it to report the actual assembly code generated by the test (which, by default, also calls a function at one level of the test), and the [HideColumns] attribute simply Suppressed some data columns that BenchmarkDotNet might emit by default but are unnecessary here.
After this, running the benchmark is simple. Each displayed test also includes a comment describing the dotnet command to run the benchmark. Usually, it looks like this:
dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0
The above dotnet run command:
- Build the benchmark with Release. This is very important for performance testing because in Debug builds most optimizations in the C# compiler and JIT compiler are disabled.
- Host project for .NET 7. Typically, when using BenchmarkDotNet, you want to target the lowest common denominator of all runtimes you want to execute, to ensure that all APIs used are available where they are needed.
- Run all benchmarks throughout the program. The –filter parameter can narrow the scope to run only the desired subset of benchmarks, but “*” means “run all benchmarks”.
- Run tests on .NET 7 and .NET 8.
Throughout this article, I show a number of benchmarks and the results I get when I run them. All code works correctly on all supported operating systems and architectures. Unless otherwise noted, benchmark results were run on Linux (Ubuntu 22.04) (x64 processor). My standard caveat: these are microbenchmarks that typically measure very short operating times, but when improvements in these times are performed over and over again, their impact is significant. Different hardware, different operating systems, other running programs, your current mood and what you had for breakfast can all affect the numbers involved. In short, don’t expect the numbers you see to exactly match the numbers I report here, although I chose examples where the magnitude of the difference is completely repeatable.
Now, let’s get started…
JIT
Code generation occurs throughout every line of code we write, and the quality of code generated by the compiler is critical to the end-to-end performance of an application. In .NET, this is Just-In-Time (JIT) programmingPreviously, the JIT always used the CastHelpers.ChkCastAny method to perform conversions, but with this change, it inlines a faster path to success.
// dotnet run -c Release -f net7.0 --filter "*" --runtimes net7.0 net8.0
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);
[HideColumns("Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
private readonly object _o = "hello";
[Benchmark]
public string GetString() => Cast(_o);
[MethodImpl(MethodImplOptions.NoInlining)]
public T Cast(object o) => (T)o;
}
Method | Runtime | Mean | Ratio |
---|---|---|---|
GetString | .NET 7.0 | 2.247 ns | 1.00 |
GetString | .NET 8.0 | 1.300ns | 0.58 |
Peephole Optimizations
A “Peephole optimization” is an optimization that replaces a small sequence of instructions with a different sequence that is expected to perform better. This might include getting rid of an instruction that is deemed unnecessary or replacing two instructions with one that accomplishes the same task. Every .NET version includes a large number of new Peephole optimizations, often inspired by real-world examples, where some of the overhead can be reduced by slightly improving code quality, and .NET 8 is no exception. Here are some of these optimizations in .NET 8:
- dotnet/runtime#73120 and dotnet/runtime#74806 improve handling of common bit test patterns such as (x & 1) != 0.
- dotnet/runtime#77874 eliminates some unnecessary conversions like short Add(short x, short y) => (short)(x + y).
- dotnet/runtime#76981 improves the performance of multiplying a number that is 1 away from the power of 2 by replacing the imul instruction with a three-instruction mov/shl/add sequence, while dotnet/runtime#77137 improves the performance of multiplying a number that is 1 away from the power of 2 by using a lea instruction. Improved other constant multiplications by replacing mov/shl sequences.
- dotnet/runtime#78786 merges individual conditions such as value < 0 || value == 0 into the equivalent value <= 0.
- dotnet/runtime#82750 eliminates some unnecessary cmp instructions.
- dotnet/runtime#79630 avoids unnecessary ands like static byte Mod(uint i) => (byte)(i % 256).
- dotnet/runtime#77540, dotnet/runtime#84399 and dotnet/runtime#85032 optimize a pair of load and store instructions and replace them with a single ldp or stp instruction on Arm.
- dotnet/runtime#84350 similarly optimizes a pair of str wzr instructions into a str xzr instruction.
- dotnet/runtime#83458 optimizes some redundant memory loading on Arm by replacing some ldr instructions with mov instructions.
- dotnet/runtime#83176 Optimize x < 0 expressions from issuing cmp/cset sequences on Arm to issuing lsr instructions.
- dotnet/runtime#82924 eliminates redundant overflow checking for certain division operations on Arm.
- dotnet/runtime#84605 Combine lsl/cmp sequences on Arm into a single cmp.
- dotnet/runtime#84667 combines neg and cmp sequences to use cmn on Arm.
- dotnet/runtime#79550 Replace the mul/neg sequence on Arm with mneg.
(Only some Arm-specific improvements are covered here. See Arm64 Performance Improvements in .NET 8 for more detailed information).
Author: Yahle
Original: http://www.cnblogs.com/yahle
all rights reserved. When reprinting, the author and original source must be cited in the form of a link.