我正在尝试了解新的dotnet核心3内部函数(https://devblogs.microsoft.com/dotnet/hardware-intrinsics-in-net-core/)。
我想从here来实现简单的c ++示例
#include "emmintrin.h"
const __m128i v2 = _mm_set1_epi64x(2);
__m128i v = _mm_set_epi64x(1, 0);
for (size_t i=0; i<1000*1000*1000; i += 2)
{
_mm_stream_si128((__m128i *)&data[i], v);
v = _mm_add_epi64(v, v2);
}
(我知道上面可以使用SIMD Vector
在C#中完成)
看着https://source.dot.net/#System.Private.CoreLib/shared/System/Runtime/Intrinsics/X86/Sse2.cs,1392,我想我需要使用函数
/// <summary>
/// void _mm_stream_si128 (__m128i* mem_addr, __m128i a)
/// MOVNTDQ m128, xmm
/// </summary>
public static unsafe void StoreAlignedNonTemporal(long* address, Vector128<long> source) => StoreAlignedNonTemporal(address, source);
我的C#程序如下。
Intrinsics.csproj:
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<TargetFramework>netcoreapp3.0</TargetFramework>
<OutputType>Exe</OutputType>
</PropertyGroup>
</Project>
Program.cs:
using System;
using System.Runtime.Intrinsics.X86;
using System.Runtime.Intrinsics;
public class Program
{
public static void Main(string[] args)
{
if(!Sse2.IsSupported){
Console.WriteLine("Your CPU doesn't support SSE2 Instruction set");
return;
}
var data = new long[100000];
var v = Vector128.Create(1L, 0L);
var v2 = Vector128.Create(0L, 0L);
Span<long> buffer = data.AsSpan();
for (int i=0; i<100000; i+=2)
{
Sse2.StoreAlignedNonTemporal(buffer[i], v);
// TODO: convert this to C#: v = _mm_add_epi64(v, v2);
}
}
}
当我尝试构建项目时,它失败并显示以下错误:
burnsba@debian:~/code/Intrinsics$ dotnet build
Microsoft (R) Build Engine version 16.3.0+0f4c62fea for .NET Core
Copyright (C) Microsoft Corporation. All rights reserved.
Restore completed in 24.2 ms for /home/burnsba/code/Intrinsics/Intrinsics.csproj.
Program.cs(22,42): error CS1503: Argument 1: cannot convert from 'long' to 'byte*' [/home/burnsba/code/Intrinsics/Intrinsics.csproj]
Program.cs(22,53): error CS1503: Argument 2: cannot convert from 'System.Runtime.Intrinsics.Vector128<long>' to 'System.Runtime.Intrinsics.Vector128<byte>' [/home/burnsba/code/Intrinsics/Intrinsics.csproj]
Build FAILED.
Program.cs(22,42): error CS1503: Argument 1: cannot convert from 'long' to 'byte*' [/home/burnsba/code/Intrinsics/Intrinsics.csproj]
Program.cs(22,53): error CS1503: Argument 2: cannot convert from 'System.Runtime.Intrinsics.Vector128<long>' to 'System.Runtime.Intrinsics.Vector128<byte>' [/home/burnsba/code/Intrinsics/Intrinsics.csproj]
0 Warning(s)
2 Error(s)
Time Elapsed 00:00:01.19
burnsba@debian:~/code/Intrinsics$ dotnet --version
3.0.100
我应该如何使用Sse2.StoreAlignedNonTemporal
?
答案 0 :(得分:1)
我得到了要编译并运行以下程序的程序。从这个意义上说,我的问题得到了回答。
Intrinsics.csproj
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<TargetFramework>netcoreapp3.0</TargetFramework>
<AllowUnsafeBlocks>true</AllowUnsafeBlocks>
<OutputType>Exe</OutputType>
<DebugSymbols>true</DebugSymbols>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="BenchmarkDotNet" Version="0.11.5" />
</ItemGroup>
</Project>
Program.cs
using System;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics.X86;
using System.Runtime.Intrinsics;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
namespace IntrinsicsDemo
{
public class Program
{
public static void Main(string[] args)
{
if (!Sse2.IsSupported)
{
Console.WriteLine("Your CPU doesn't support SSE2 Instruction set");
return;
}
var summary = BenchmarkRunner.Run<IntrinsicsBench>();
}
}
[SimpleJob]
[MemoryDiagnoser]
public unsafe class IntrinsicsBench
{
private long[] _data = new long[100000];
private Vector128<long> _v = Vector128.Create(1L, 0L);
private Vector128<long> _v2 = Vector128.Create(0L, 0L);
public IntrinsicsBench()
{
for (var i = 0; i < _data.Length; i++)
{
_data[i] = 0;
}
}
[Benchmark(Baseline = true)]
public long[] Default()
{
for (var i = 0; i < _data.Length; i++)
{
_data[i] = i;
}
return _data;
}
[Benchmark]
public long[] DefaultSpan()
{
var buffer = _data.AsSpan();
for (var i = 0; i < buffer.Length; i++)
{
buffer[i] = i;
}
return _data;
}
[Benchmark]
public long[] Unroll8()
{
var buffer = _data.AsSpan();
for (var i = 0; i < buffer.Length; i += 8)
{
buffer[i + 0] = i + 0;
buffer[i + 1] = i + 1;
buffer[i + 2] = i + 2;
buffer[i + 3] = i + 3;
buffer[i + 4] = i + 4;
buffer[i + 5] = i + 5;
buffer[i + 6] = i + 6;
buffer[i + 7] = i + 7;
}
return _data;
}
[Benchmark]
public long[] Sse2Test()
{
unsafe
{
fixed (long* lp = _data)
{
for (int i = 0; i < _data.Length; i += 2)
{
Sse2.StoreAlignedNonTemporal(lp + i, _v);
_v = Sse2.Add(_v, _v2);
}
}
}
return _data;
}
}
}
但是,使用SSE2内部函数的方法的速度是默认的朴素实现的两倍以上:
BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.805 (1809/October2018Update/Redstone5)
Intel Core i7-8850H CPU 2.60GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100
[Host] : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), 64bit RyuJIT
DefaultJob : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), 64bit RyuJIT
| Method | Mean | Error | StdDev | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------ |----------:|----------:|----------:|------:|--------:|------:|------:|------:|----------:|
| Default | 43.53 us | 0.8155 us | 0.8009 us | 1.00 | 0.00 | - | - | - | - |
| DefaultSpan | 43.51 us | 0.4265 us | 0.3562 us | 1.00 | 0.02 | - | - | - | - |
| Unroll8 | 32.81 us | 0.6404 us | 0.8327 us | 0.76 | 0.03 | - | - | - | - |
| Sse2Test | 104.92 us | 2.0906 us | 2.5674 us | 2.41 | 0.08 | - | - | - | - |
不确定怎么了。
答案 1 :(得分:0)
您需要使用long*
作为StoreAlignedNonTemporal
的参数,但是您要提供long
。编译器无法找到匹配的方法签名。
您可以在这里查看其用法:coreclr/tests/src/JIT/HardwareIntrinsics/X86/Sse2/StoreAlignedNonTemporal.cs
long* inArray = stackalloc long[2]; byte* outBuffer = stackalloc byte[32]; long* outArray = (long*)Align(outBuffer, 16); var vf = Unsafe.Read<Vector128<long>>(inArray); Sse2.StoreAlignedNonTemporal(outArray, vf);
请注意,只有在unsafe
代码中才能使用内在函数。