如何使用dotnet core 3内部的StoreAlignedNonTemporal

时间:2019-10-30 15:30:47

标签: c# .net-core sse

我正在尝试了解新的dotnet核心3内部函数(https://devblogs.microsoft.com/dotnet/hardware-intrinsics-in-net-core/)。

我想从here来实现简单的c ++示例

#include "emmintrin.h"

const __m128i v2 = _mm_set1_epi64x(2);
__m128i v = _mm_set_epi64x(1, 0);

for (size_t i=0; i<1000*1000*1000; i += 2)
{
    _mm_stream_si128((__m128i *)&data[i], v);
    v = _mm_add_epi64(v, v2);
}  

(我知道上面可以使用SIMD Vector在C#中完成)

看着https://source.dot.net/#System.Private.CoreLib/shared/System/Runtime/Intrinsics/X86/Sse2.cs,1392,我想我需要使用函数

/// <summary>
/// void _mm_stream_si128 (__m128i* mem_addr, __m128i a)
///   MOVNTDQ m128, xmm
/// </summary>
public static unsafe void StoreAlignedNonTemporal(long* address, Vector128<long> source) => StoreAlignedNonTemporal(address, source);

我的C#程序如下。

Intrinsics.csproj:

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <TargetFramework>netcoreapp3.0</TargetFramework>
    <OutputType>Exe</OutputType>
  </PropertyGroup>

</Project>

Program.cs:

using System;
using System.Runtime.Intrinsics.X86;
using System.Runtime.Intrinsics;

public class Program
{
    public static void Main(string[] args)
    {
        if(!Sse2.IsSupported){
                Console.WriteLine("Your CPU doesn't support SSE2 Instruction set");
                return;
            }

        var data = new long[100000];
        var v = Vector128.Create(1L, 0L);
        var v2 = Vector128.Create(0L, 0L);

        Span<long> buffer = data.AsSpan();

        for (int i=0; i<100000; i+=2)
        {
            Sse2.StoreAlignedNonTemporal(buffer[i], v);
            // TODO: convert this to C#: v = _mm_add_epi64(v, v2);
        }
    }
}

当我尝试构建项目时,它失败并显示以下错误:

burnsba@debian:~/code/Intrinsics$ dotnet build
Microsoft (R) Build Engine version 16.3.0+0f4c62fea for .NET Core
Copyright (C) Microsoft Corporation. All rights reserved.

  Restore completed in 24.2 ms for /home/burnsba/code/Intrinsics/Intrinsics.csproj.
Program.cs(22,42): error CS1503: Argument 1: cannot convert from 'long' to 'byte*' [/home/burnsba/code/Intrinsics/Intrinsics.csproj]
Program.cs(22,53): error CS1503: Argument 2: cannot convert from 'System.Runtime.Intrinsics.Vector128<long>' to 'System.Runtime.Intrinsics.Vector128<byte>' [/home/burnsba/code/Intrinsics/Intrinsics.csproj]

Build FAILED.

Program.cs(22,42): error CS1503: Argument 1: cannot convert from 'long' to 'byte*' [/home/burnsba/code/Intrinsics/Intrinsics.csproj]
Program.cs(22,53): error CS1503: Argument 2: cannot convert from 'System.Runtime.Intrinsics.Vector128<long>' to 'System.Runtime.Intrinsics.Vector128<byte>' [/home/burnsba/code/Intrinsics/Intrinsics.csproj]
    0 Warning(s)
    2 Error(s)

Time Elapsed 00:00:01.19

burnsba@debian:~/code/Intrinsics$ dotnet --version
3.0.100

我应该如何使用Sse2.StoreAlignedNonTemporal

2 个答案:

答案 0 :(得分:1)

我得到了要编译并运行以下程序的程序。从这个意义上说,我的问题得到了回答。

Intrinsics.csproj

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <TargetFramework>netcoreapp3.0</TargetFramework>
    <AllowUnsafeBlocks>true</AllowUnsafeBlocks>
    <OutputType>Exe</OutputType>
    <DebugSymbols>true</DebugSymbols>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="BenchmarkDotNet" Version="0.11.5" />
  </ItemGroup>

</Project>

Program.cs

using System;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics.X86;
using System.Runtime.Intrinsics;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

namespace IntrinsicsDemo
{
    public class Program
    {
        public static void Main(string[] args)
        {
            if (!Sse2.IsSupported)
            {
                Console.WriteLine("Your CPU doesn't support SSE2 Instruction set");
                return;
            }

            var summary = BenchmarkRunner.Run<IntrinsicsBench>();
        }
    }

    [SimpleJob]
    [MemoryDiagnoser]
    public unsafe class IntrinsicsBench
    {
        private long[] _data = new long[100000];
        private Vector128<long> _v = Vector128.Create(1L, 0L);
        private Vector128<long> _v2 = Vector128.Create(0L, 0L);

        public IntrinsicsBench()
        {
            for (var i = 0; i < _data.Length; i++)
            {
                _data[i] = 0;
            }
        }

        [Benchmark(Baseline = true)]
        public long[] Default()
        {
            for (var i = 0; i < _data.Length; i++)
            {
                _data[i] = i;
            }

            return _data;
        }

        [Benchmark]
        public long[] DefaultSpan()
        {
            var buffer = _data.AsSpan();
            for (var i = 0; i < buffer.Length; i++)
            {
                buffer[i] = i;
            }

            return _data;
        }

        [Benchmark]
        public long[] Unroll8()
        {
            var buffer = _data.AsSpan();
            for (var i = 0; i < buffer.Length; i += 8)
            {
                buffer[i + 0] = i + 0;
                buffer[i + 1] = i + 1;
                buffer[i + 2] = i + 2;
                buffer[i + 3] = i + 3;
                buffer[i + 4] = i + 4;
                buffer[i + 5] = i + 5;
                buffer[i + 6] = i + 6;
                buffer[i + 7] = i + 7;
            }

            return _data;
        }

        [Benchmark]
        public long[] Sse2Test()
        {
            unsafe
            {
                fixed (long* lp = _data)
                {
                    for (int i = 0; i < _data.Length; i += 2)
                    {
                        Sse2.StoreAlignedNonTemporal(lp + i, _v);
                        _v = Sse2.Add(_v, _v2);
                    }
                }
            }

            return _data;
        }
    }
}

但是,使用SSE2内部函数的方法的速度是默认的朴素实现的两倍以上:

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17763.805 (1809/October2018Update/Redstone5)
Intel Core i7-8850H CPU 2.60GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100
  [Host]     : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), 64bit RyuJIT
  DefaultJob : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), 64bit RyuJIT


|      Method |      Mean |     Error |    StdDev | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------ |----------:|----------:|----------:|------:|--------:|------:|------:|------:|----------:|
|     Default |  43.53 us | 0.8155 us | 0.8009 us |  1.00 |    0.00 |     - |     - |     - |         - |
| DefaultSpan |  43.51 us | 0.4265 us | 0.3562 us |  1.00 |    0.02 |     - |     - |     - |         - |
|     Unroll8 |  32.81 us | 0.6404 us | 0.8327 us |  0.76 |    0.03 |     - |     - |     - |         - |
|    Sse2Test | 104.92 us | 2.0906 us | 2.5674 us |  2.41 |    0.08 |     - |     - |     - |         - |

不确定怎么了。

答案 1 :(得分:0)

您需要使用long*作为StoreAlignedNonTemporal的参数,但是您要提供long。编译器无法找到匹配的方法签名。

您可以在这里查看其用法:coreclr/tests/src/JIT/HardwareIntrinsics/X86/Sse2/StoreAlignedNonTemporal.cs

  
long* inArray = stackalloc long[2]; 
byte* outBuffer = stackalloc byte[32]; 
long* outArray = (long*)Align(outBuffer, 16);

var vf = Unsafe.Read<Vector128<long>>(inArray);
Sse2.StoreAlignedNonTemporal(outArray, vf); 

请注意,只有在unsafe代码中才能使用内在函数。