要点:
我的类SquareDistance使用具有以下名称的方法以四种方式计算笛卡尔距离的平方:
第一个是最快的并使用有符号整数,但我的数据必须是无符号的(由于下面给出的原因)。其他三种方法以无符号数字开头。我的目标是编写一个像SquareDistance那样的方法,它采用无符号数据并且比我已经编写的三个方法表现更好,尽可能接近#1的性能。代码与基准测试结果如下。 (如果您认为有帮助,则允许使用不安全的代码。)
详细信息:
我正在开发一种算法,使用从希尔伯特曲线导出的索引来解决K-最近邻问题。朴素的线性扫描算法的执行时间与点的数量呈时间平方,并与维度的数量呈线性关系,并且它花费所有时间来计算和比较笛卡尔距离。
特殊希尔伯特指数背后的动机是减少调用距离函数的次数。但是,它仍然必须被调用数百万次,所以我必须尽可能快地完成它。 (这是程序中最常被调用的函数。最近失败的优化距离函数的尝试将程序执行时间从7分钟加倍到15分钟,所以不,这不是过早或多余的优化。)
尺寸:积分可能有十到五千个维度。
约束即可。我有两个恼人的约束:
希尔伯特变换逻辑要求将点表示为uint(无符号整数)数组。 (代码是由另一个人编写的,是魔术并且使用移位,AND,OR等等,并且无法更改。)将我的点存储为有符号整数并且不断地将它们转换为uint数组会产生可怜的性能,所以我必须至少存储每个点的uint数组副本。
为了提高效率,我制作了每个点的有符号整数副本,以加快距离计算。这非常有效,但是一旦达到大约3,000维度,我的内存就会耗尽!
为了节省内存,我删除了已记忆的有符号整数数组,并尝试编写一个无符号版本的距离计算。我的最佳结果是有符号整数版本的2.25倍。
基准测试创建1000个随机点,每个点包含1000个维度,并在每个点和每个其他点之间执行距离计算,进行1,000,000次比较。因为我只关心相对距离,所以不通过执行平方根来节省时间。
在调试模式下:
SignedBenchmark Ratio: 1.000 Seconds: 3.739 UnsignedBranchingBenchmark Ratio: 2.731 Seconds: 10.212 UnsignedDistributeBenchmark Ratio: 3.294 Seconds: 12.320 CastToSignedLongBenchmark Ratio: 3.265 Seconds: 12.211
在发布模式下:
SignedBenchmark Ratio: 1.000 Seconds: 3.494
UnsignedBranchingBenchmark Ratio: 2.672 Seconds: 9.334
UnsignedDistributeBenchmark Ratio: 3.336 Seconds: 11.657
CastToSignedLongBenchmark Ratio: 3.471 Seconds: 12.127
上述基准测试是在戴尔配备英特尔酷睿i7-4800MQ CPU @ 2.70GHz和16 GB内存的基础上运行的。我的大型算法已经将任务并行库用于更大的任务,因此并行化这个内循环是徒劳的。
问题:有人能想到比UnsignedBranching更快的算法吗?
以下是我的基准代码。
更新
这使用循环展开(感谢@dasblinkenlight),速度提高了2.7倍:
public static long UnsignedLoopUnrolledBranching(uint[] x, uint[] y)
{
var distance = 0UL;
var leftovers = x.Length % 4;
var dimensions = x.Length;
var roundDimensions = dimensions - leftovers;
for (var i = 0; i < roundDimensions; i += 4)
{
var x1 = x[i];
var y1 = y[i];
var x2 = x[i+1];
var y2 = y[i+1];
var x3 = x[i+2];
var y3 = y[i+2];
var x4 = x[i+3];
var y4 = y[i+3];
var delta1 = x1 > y1 ? x1 - y1 : y1 - x1;
var delta2 = x2 > y2 ? x2 - y2 : y2 - x2;
var delta3 = x3 > y3 ? x3 - y3 : y3 - x3;
var delta4 = x4 > y4 ? x4 - y4 : y4 - x4;
distance += delta1 * delta1 + delta2 * delta2 + delta3 * delta3 + delta4 * delta4;
}
for (var i = roundDimensions; i < dimensions; i++)
{
var xi = x[i];
var yi = y[i];
var delta = xi > yi ? xi - yi : yi - xi;
distance += delta * delta;
}
return (long)distance;
}
SquareDistance.cs:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace DistanceBenchmark
{
/// <summary>
/// Provide several alternate methods for computing the square of the Cartesian distance
/// to allow study of their relative performance.
/// </summary>
public static class SquareDistance
{
/// <summary>
/// Compute the square of the Cartesian distance between two N-dimensional points
/// with calculations done on signed numbers using signed arithmetic,
/// a single multiplication and no branching.
/// </summary>
/// <param name="x">First point.</param>
/// <param name="y">Second point.</param>
/// <returns>Square of the distance.</returns>
public static long Signed(int[] x, int[] y)
{
var distance = 0L;
var dimensions = x.Length;
for (var i = 0; i < dimensions; i++)
{
var delta = x[i] - y[i];
distance += delta * delta;
}
return distance;
}
/// <summary>
/// Compute the square of the Cartesian distance between two N-dimensional points
/// with calculations done on unsigned numbers using unsigned arithmetic, a single multiplication
/// and a branching instruction (the ternary operator).
/// </summary>
/// <param name="x">First point.</param>
/// <param name="y">Second point.</param>
/// <returns>Square of the distance.</returns>
public static long UnsignedBranching(uint[] x, uint[] y)
{
var distance = 0UL;
var dimensions = x.Length;
for (var i = 0; i < dimensions; i++)
{
var xi = x[i];
var yi = y[i];
var delta = xi > yi ? xi - yi : yi - xi;
distance += delta * delta;
}
return (long)distance;
}
/// <summary>
/// Compute the square of the Cartesian distance between two N-dimensional points
/// with calculations done on unsigned numbers using unsigned arithmetic and the distributive law,
/// which requires four multiplications and no branching.
///
/// To prevent overflow, the coordinates are cast to ulongs.
/// </summary>
/// <param name="x">First point.</param>
/// <param name="y">Second point.</param>
/// <returns>Square of the distance.</returns>
public static long UnsignedDistribute(uint[] x, uint[] y)
{
var distance = 0UL;
var dimensions = x.Length;
for (var i = 0; i < dimensions; i++)
{
ulong xi = x[i];
ulong yi = y[i];
distance += xi * xi + yi * yi - 2 * xi * yi;
}
return (long)distance;
}
/// <summary>
/// Compute the square of the Cartesian distance between two N-dimensional points
/// with calculations done on unsigned numbers using signed arithmetic,
/// by first casting the values into longs.
/// </summary>
/// <param name="x">First point.</param>
/// <param name="y">Second point.</param>
/// <returns>Square of the distance.</returns>
public static long CastToSignedLong(uint[] x, uint[] y)
{
var distance = 0L;
var dimensions = x.Length;
for (var i = 0; i < dimensions; i++)
{
var delta = (long)x[i] - (long)y[i];
distance += delta * delta;
}
return distance;
}
}
}
RandomPointFactory.cs:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace DistanceBenchmark
{
public static class RandomPointFactory
{
/// <summary>
/// Get a random list of signed integer points with the given number of dimensions to use as test data.
/// </summary>
/// <param name="recordCount">Number of points to get.</param>
/// <param name="dimensions">Number of dimensions per point.</param>
/// <returns>Signed integer test data.</returns>
public static IList<int[]> GetSignedTestPoints(int recordCount, int dimensions)
{
var testData = new List<int[]>();
var random = new Random(DateTime.Now.Millisecond);
for (var iRecord = 0; iRecord < recordCount; iRecord++)
{
int[] point;
testData.Add(point = new int[dimensions]);
for (var d = 0; d < dimensions; d++)
point[d] = random.Next(100000);
}
return testData;
}
/// <summary>
/// Get a random list of unsigned integer points with the given number of dimensions to use as test data.
/// </summary>
/// <param name="recordCount">Number of points to get.</param>
/// <param name="dimensions">Number of dimensions per point.</param>
/// <returns>Unsigned integer test data.</returns>
public static IList<uint[]> GetUnsignedTestPoints(int recordCount, int dimensions)
{
var testData = new List<uint[]>();
var random = new Random(DateTime.Now.Millisecond);
for (var iRecord = 0; iRecord < recordCount; iRecord++)
{
uint[] point;
testData.Add(point = new uint[dimensions]);
for (var d = 0; d < dimensions; d++)
point[d] = (uint)random.Next(100000);
}
return testData;
}
}
}
的Program.cs:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace DistanceBenchmark
{
public class Program
{
private static IList<int[]> SignedTestData = RandomPointFactory.GetSignedTestPoints(1000, 1000);
private static IList<uint[]> UnsignedTestData = RandomPointFactory.GetUnsignedTestPoints(1000, 1000);
static void Main(string[] args)
{
var baseline = TimeIt("SignedBenchmark", SignedBenchmark);
TimeIt("UnsignedBranchingBenchmark", UnsignedBranchingBenchmark, baseline);
TimeIt("UnsignedDistributeBenchmark", UnsignedDistributeBenchmark, baseline);
TimeIt("CastToSignedLongBenchmark", CastToSignedLongBenchmark, baseline);
TimeIt("SignedBenchmark", SignedBenchmark, baseline);
Console.WriteLine("Done. Type any key to exit.");
Console.ReadLine();
}
public static void SignedBenchmark()
{
foreach(var p1 in SignedTestData)
foreach (var p2 in SignedTestData)
SquareDistance.Signed(p1, p2);
}
public static void UnsignedBranchingBenchmark()
{
foreach (var p1 in UnsignedTestData)
foreach (var p2 in UnsignedTestData)
SquareDistance.UnsignedBranching(p1, p2);
}
public static void UnsignedDistributeBenchmark()
{
foreach (var p1 in UnsignedTestData)
foreach (var p2 in UnsignedTestData)
SquareDistance.UnsignedDistribute(p1, p2);
}
public static void CastToSignedLongBenchmark()
{
foreach (var p1 in UnsignedTestData)
foreach (var p2 in UnsignedTestData)
SquareDistance.CastToSignedLong(p1, p2);
}
public static double TimeIt(String testName, Action benchmark, double baseline = 0.0)
{
var stopwatch = new Stopwatch();
stopwatch.Start();
benchmark();
stopwatch.Stop();
var seconds = stopwatch.Elapsed.TotalSeconds;
var ratio = baseline <= 0 ? 1.0 : seconds/baseline;
Console.WriteLine(String.Format("{0,-32} Ratio: {1:0.000} Seconds: {2:0.000}", testName, ratio, seconds));
return seconds;
}
}
}
答案 0 :(得分:2)
你应该能够通过unrolling your loops:
来节省大量的执行时间public static long Signed(int[] x, int[] y)
{
var distance = 0L;
var dimensions = x.Length;
var stop = dimensions - (dimensions % 4);
for (var i = 0; i < stop; i+=4)
{
var delta0 = x[i] - y[i];
var delta1 = x[i+1] - y[i+1];
var delta2 = x[i+2] - y[i+2];
var delta3 = x[i+3] - y[i+3];
distance += (delta0 * delta0)
+ (delta1 * delta1)
+ (delta2 * delta2)
+ (delta3 * delta3);
}
for (var i = stop; i < dimensions; i++)
{
var delta = x[i] - y[i];
distance += delta * delta;
}
return distance;
}
仅此更改就将本地系统的执行时间从8.325秒减少到4.745秒 - 提高了43%!
这个想法是尽可能一次做四个点,然后在一个单独的循环中完成其余的点。
答案 1 :(得分:1)
如果你不能改变希尔伯特曲线,你可以尝试一条z曲线,即一个莫顿曲线。将维度转换为二进制并将其交错。然后排序。您可以使用最大符号位验证上限。 n维中的希尔伯特曲线使用格雷码,也许您可以在互联网上搜索更快的版本。您可以在黑客食谱中找到一些快速实现。莫顿曲线应该类似于h树。当您需要精度时,您可以尝试希尔伯特曲线的副本,即摩尔曲线。例如,在2d中,您可以交错4个希尔伯特曲线:
” ,
答案 2 :(得分:0)
我能看到的最好的改进不会是一个低悬的果实。这种问题不适合当前版本的.net框架(或一般的CPU)。
您遇到的问题类别称为SIMD。您可能听说过Intel Pentium MMX。 MMX指令集是SIMD指令集的营销术语。
有三种很好的方法可以使SIMD与您的程序一起运行。按照从最慢到最快的顺序,以及最简单到最难的顺序。
RyuJIT
(下一个.net编译器的预览版)以利用CPU SIMD P/Invoke
进入C++ AMP
到您的GPU上我强烈建议您尝试利用带有C ++ AMP的GPU,特别是因为uint[]
应该很容易传递给C++ AMP
。
答案 3 :(得分:0)
在今天早上的淋浴中,我想出了一种方法,使用点积进一步改善这一点,当数据存储为uint []数组时,再削减百分之五十。之前我曾调查过这个想法,但未能识别出我可以通过预计算优化的循环不变量。该想法的基础是分配操作:
(x-y)(x-y) = x*x + y*y - 2xy
如果我对所有坐标求和,结果是:
2 2 2
D = |x| + |y| - 2(x·y)
由于我将执行大量的距离计算,我可以存储每个向量的平方长度。然后找到两个向量之间的距离相当于它们的平方距离(在循环外)和计算点积的计算,它没有负值,因此不需要分支!
为什么分支出问题?这是因为使用uint向量,您无法使用分支操作减去笛卡尔公式中的值来测试哪个值更大。因此,如果我想要(x-y)*(x-y),我需要在循环中执行此操作:
var delta = x[i] > y[i] ? x[i] - y[i] : y[i] - x[i];
distance += delta * delta;
另外,为了防止从uint到ulong的溢出,我需要将数字强制转换为ulong,这真的杀死了性能。由于我的大多数坐标都很小,我能够创建一个测试。我还存储每个向量的最大值。由于我一次通过四次迭代展开我的循环,如果4 * xMax * yMax没有溢出uint,我可以免除我的大部分投射操作。如果测试失败,我会做更昂贵的版本,投射更多。
最后,我有几个实现:带有转换的天真,带有分支,分配了铸件和不移除的循环不变量,以及更少铸造和不变量移除的点积。
朴素方法在每次循环迭代中都有减法,乘法和加法。删除了循环不变量的点积分布仅使用乘法和加法。
以下是基准:
For 100000 iterations and 2000 dimensions.
Naive time = 2.505 sec.
Branch time = 0.628 sec.
Distributed time = 6.371 sec.
Dot Product time = 0.288 sec.
Improve vs Naive = 88.5%.
Improve vs Branch = 54.14%.
这是作为NUnit测试的代码:
using System;
using System.Diagnostics;
using NUnit.Framework;
using System.Linq;
namespace HilbertTransformationTests
{
[TestFixture]
public class CartesianDistanceTests
{
[Test]
public void SquareDistanceBenchmark()
{
var dims = 2000;
var x = new uint[dims];
var y = new uint[dims];
var xMag2 = 0L;
var yMag2 = 0L;
for (var i = 0; i < dims; i++)
{
x[i] = (uint)i;
xMag2 += x[i] * (long)x[i];
y[i] = (uint)(10000 - i);
yMag2 += y[i] * (long)y[i];
}
var xMax = (long)x.Max();
var yMax = (long)y.Max();
var repetitions = 100000;
var naiveTime = Time(() => SquareDistanceNaive(x, y), repetitions);
var distributeTime = Time(() => SquareDistanceDistributed(x, y), repetitions);
var branchTime = Time(() => SquareDistanceBranching(x, y), repetitions);
var dotProductTime = Time(() => SquareDistanceDotProduct(x, y, xMag2, yMag2, xMax, yMax), repetitions);
Console.Write($@"
For {repetitions} iterations and {dims} dimensions.
Naive time = {naiveTime} sec.
Branch time = {branchTime} sec.
Distributed time = {distributeTime} sec.
Dot Product time = {dotProductTime} sec.
Improve vs Naive = {((int)(10000 * (naiveTime - dotProductTime) / naiveTime)) / 100.0}%.
Improve vs Branch = {((int)(10000 * (branchTime - dotProductTime) / branchTime)) / 100.0}%.
");
Assert.Less(dotProductTime, branchTime, "Dot product time should have been less than branch time");
}
private static double Time(Action action, int repeatCount)
{
var timer = new Stopwatch();
timer.Start();
for (var j = 0; j < repeatCount; j++)
action();
timer.Stop();
return timer.ElapsedMilliseconds / 1000.0;
}
private static long SquareDistanceNaive(uint[] x, uint[] y)
{
var squareDistance = 0L;
for (var i = 0; i < x.Length; i++)
{
var delta = (long)x[i] - (long)y[i];
squareDistance += delta * delta;
}
return squareDistance;
}
/// <summary>
/// Compute the square distance, using ternary operators for branching to keep subtraction operations from going negative,
/// which is inappropriate for unsigned numbers.
/// </summary>
/// <returns>The distance branching.</returns>
/// <param name="x">The x coordinate.</param>
/// <param name="y">The y coordinate.</param>
private static long SquareDistanceBranching(uint[] x, uint[] y)
{
long squareDistanceLoopUnrolled;
// Unroll the loop partially to improve speed. (2.7x improvement!)
var distance = 0UL;
var leftovers = x.Length % 4;
var dimensions = x.Length;
var roundDimensions = dimensions - leftovers;
for (var i = 0; i < roundDimensions; i += 4)
{
var x1 = x[i];
var y1 = y[i];
var x2 = x[i + 1];
var y2 = y[i + 1];
var x3 = x[i + 2];
var y3 = y[i + 2];
var x4 = x[i + 3];
var y4 = y[i + 3];
var delta1 = x1 > y1 ? x1 - y1 : y1 - x1;
var delta2 = x2 > y2 ? x2 - y2 : y2 - x2;
var delta3 = x3 > y3 ? x3 - y3 : y3 - x3;
var delta4 = x4 > y4 ? x4 - y4 : y4 - x4;
distance += delta1 * delta1 + delta2 * delta2 + delta3 * delta3 + delta4 * delta4;
}
for (var i = roundDimensions; i < dimensions; i++)
{
var xi = x[i];
var yi = y[i];
var delta = xi > yi ? xi - yi : yi - xi;
distance += delta * delta;
}
squareDistanceLoopUnrolled = (long)distance;
return squareDistanceLoopUnrolled;
}
private static long SquareDistanceDistributed(uint[] x, uint[] y)
{
long squareDistanceLoopUnrolled;
// Unroll the loop partially to improve speed. (2.7x improvement!)
var distance = 0UL;
var dSubtract = 0UL;
var leftovers = x.Length % 4;
var dimensions = x.Length;
var roundDimensions = dimensions - leftovers;
for (var i = 0; i < roundDimensions; i += 4)
{
ulong x1 = x[i];
ulong y1 = y[i];
ulong x2 = x[i + 1];
ulong y2 = y[i + 1];
ulong x3 = x[i + 2];
ulong y3 = y[i + 2];
ulong x4 = x[i + 3];
ulong y4 = y[i + 3];
distance += x1 * x1 + y1 * y1
+ x2 * x2 + y2 * y2
+ x3 * x3 + y3 * y3
+ x4 * x4 + y4 * y4;
dSubtract += x1 * y1 + x2 * y2 + x3 * y3 + x4 * y4;
}
distance = distance - 2UL * dSubtract;
for (var i = roundDimensions; i < dimensions; i++)
{
var xi = x[i];
var yi = y[i];
var delta = xi > yi ? xi - yi : yi - xi;
distance += delta * delta;
}
squareDistanceLoopUnrolled = (long)distance;
return squareDistanceLoopUnrolled;
}
private static long SquareDistanceDotProduct(uint[] x, uint[] y, long xMag2, long yMag2, long xMax, long yMax)
{
const int unroll = 4;
if (xMax * yMax * unroll < (long) uint.MaxValue)
return SquareDistanceDotProductNoOverflow(x, y, xMag2, yMag2);
// Unroll the loop partially to improve speed. (2.7x improvement!)
var dotProduct = 0UL;
var leftovers = x.Length % unroll;
var dimensions = x.Length;
var roundDimensions = dimensions - leftovers;
for (var i = 0; i < roundDimensions; i += unroll)
{
var x1 = x[i];
ulong y1 = y[i];
var x2 = x[i + 1];
ulong y2 = y[i + 1];
var x3 = x[i + 2];
ulong y3 = y[i + 2];
var x4 = x[i + 3];
ulong y4 = y[i + 3];
dotProduct += x1 * y1 + x2 * y2 + x3 * y3 + x4 * y4;
}
for (var i = roundDimensions; i < dimensions; i++)
dotProduct += x[i] * (ulong)y[i];
return xMag2 + yMag2 - 2L * (long)dotProduct;
}
/// <summary>
/// Compute the square of the Cartesian distance using the dotproduct method,
/// assuming that calculations wont overflow uint.
///
/// This permits us to skip some widening conversions to ulong, making the computation faster.
///
/// Algorithm:
///
/// 2 2 2
/// D = |x| + |y| - 2(x·y)
///
/// Using the dot product of x and y and precomputed values for the square magnitudes of x and y
/// permits us to use two operations (multiply and add) instead of three (subtract, multiply and add)
/// in the main loop, saving one third of the time.
/// </summary>
/// <returns>The square distance.</returns>
/// <param name="x">First point.</param>
/// <param name="y">Second point.</param>
/// <param name="xMag2">Distance from x to the origin, squared.</param>
/// <param name="yMag2">Distance from y to the origin, squared.</param>
private static long SquareDistanceDotProductNoOverflow(uint[] x, uint[] y, long xMag2, long yMag2)
{
// Unroll the loop partially to improve speed. (2.7x improvement!)
const int unroll = 4;
var dotProduct = 0UL;
var leftovers = x.Length % unroll;
var dimensions = x.Length;
var roundDimensions = dimensions - leftovers;
for (var i = 0; i < roundDimensions; i += unroll)
dotProduct += (x[i] * y[i] + x[i+1] * y[i+1] + x[i+2] * y[i+2] + x[i+3] * y[i+3]);
for (var i = roundDimensions; i < dimensions; i++)
dotProduct += x[i] * y[i];
return xMag2 + yMag2 - 2L * (long)dotProduct;
}
}
}