在C#中,我希望将双精度舍入到较低的精度,以便我可以将它们存储在关联数组中的不同大小的存储桶中。与通常的舍入不同,我想要舍入到一些重要的位。因此,大数字的绝对值会比小数字更改,但它们往往会按比例改变。因此,如果我想要舍入到10个二进制数字,我会找到十个最高有效位,并将所有低位都清零,可能会添加一个小数字进行四舍五入。
我更喜欢将“中途”数字四舍五入。
如果是整数类型,这可能是一种算法:
1. Find: zero-based index of the most significant binary digit set H. 2. Compute: B = H - P, where P is the number of significant digits of precision to round and B is the binary digit to start rounding, where B = 0 is the ones place, B = 1 is the twos place, etc. 3. Add: x = x + 2^B This will force a carry if necessary (we round halfway values up). 4. Zero out: x = x mod 2^(B+1). This clears the B place and all lower digits.
问题是找到找到最高位集的有效方法。 如果我使用整数,那么找到MSB就会有很酷的攻击。 如果我可以帮助它,我不想调用Round(Log2(x))。 这个函数将被调用数百万次。
注意:我读过这个问题:
What is a good way to round double-precision values to a (somewhat) lower precision?
适用于C ++。我正在使用C#。
更新:
这是我正在使用的代码(根据回答者提供的内容进行了修改):
/// <summary>
/// Round numbers to a specified number of significant binary digits.
///
/// For example, to 3 places, numbers from zero to seven are unchanged, because they only require 3 binary digits,
/// but larger numbers lose precision:
///
/// 8 1000 => 1000 8
/// 9 1001 => 1010 10
/// 10 1010 => 1010 10
/// 11 1011 => 1100 12
/// 12 1100 => 1100 12
/// 13 1101 => 1110 14
/// 14 1110 => 1110 14
/// 15 1111 =>10000 16
/// 16 10000 =>10000 16
///
/// This is different from rounding in that we are specifying the place where rounding occurs as the distance to the right
/// in binary digits from the highest bit set, not the distance to the left from the zero bit.
/// </summary>
/// <param name="d">Number to be rounded.</param>
/// <param name="digits">Number of binary digits of precision to preserve. </param>
public static double AdjustPrecision(this double d, int digits)
{
// TODO: Not sure if this will work for both normalized and denormalized doubles. Needs more research.
var shift = 53 - digits; // IEEE 754 doubles have 53 bits of significand, but one bit is "implied" and not stored.
ulong significandMask = (0xffffffffffffffffUL >> shift) << shift;
var local_d = d;
unsafe
{
// double -> fixed point (sorta)
ulong toLong = *(ulong*)(&local_d);
// mask off your least-sig bits
var modLong = toLong & significandMask;
// fixed point -> float (sorta)
local_d = *(double*)(&modLong);
}
return local_d;
}
更新2:Dekker的算法
我从Dekker的算法中得到了这个,感谢另一位受访者。它舍入到最接近的值,而不是像上面的代码那样截断,它只使用安全代码:
private static double[] PowersOfTwoPlusOne;
static NumericalAlgorithms()
{
PowersOfTwoPlusOne = new double[54];
for (var i = 0; i < PowersOfTwoPlusOne.Length; i++)
{
if (i == 0)
PowersOfTwoPlusOne[i] = 1; // Special case.
else
{
long two_to_i_plus_one = (1L << i) + 1L;
PowersOfTwoPlusOne[i] = (double)two_to_i_plus_one;
}
}
}
public static double AdjustPrecisionSafely(this double d, int digits)
{
double t = d * PowersOfTwoPlusOne[53 - digits];
double adjusted = t - (t - d);
return adjusted;
}
更新2:时间安排
我进行了测试,发现Dekker的算法比TWICE快得多!
测试中的来电次数:100,000,000
不安全时间= 1.922(秒)
安全时间= 0.799(秒)
答案 0 :(得分:8)
Dekker的算法会将浮点数分成高低部分。如果有效数据中有 s 位(IEEE 754 64位二进制中为53),则*x0
接收高 s - b 位,这是您请求的,*x1
接收剩余的位,您可以丢弃它们。在下面的代码中,Scale
的值应为2 b 。如果在编译时知道 b ,例如常数43,则可以用Scale
替换0x1p43
。否则,你必须以某种方式产生2 b 。
这需要圆到最近的模式。 IEEE 754算术就足够了,但其他合理的算法也可以。它将关系变为偶数,这不是你要求的(向上绑定)。这有必要吗?
这假设x * (Scale + 1)
没有溢出。必须以双精度(不大于)精度评估操作。
void Split(double *x0, double *x1, double x)
{
double d = x * (Scale + 1);
double t = d - x;
*x0 = d - t;
*x1 = x - *x0;
}
答案 1 :(得分:2)
有趣......从来没有听说过需要这个,但我认为你可以通过一些时髦的不安全代码“做到”......
void Main()
{
// how many bits you want "saved"
var maxBits = 20;
// create a mask like 0x1111000 where # of 1's == maxBits
var shift = (sizeof(int) * 8) - maxBits;
var maxBitsMask = (0xffffffff >> shift) << shift;
// some floats
var floats = new []{ 1.04125f, 2.19412347f, 3.1415926f};
foreach (var f in floats)
{
var localf = f;
unsafe
{
// float -> fixed point (sorta)
int toInt = *(int*)(&localf);
// mask off your least-sig bits
var modInt = toInt & maxBitsMask;
// fixed point -> float (sorta)
localf = *(float*)(&modInt);
}
Console.WriteLine("Was {0}, now {1}", f, localf);
}
}
并且有双打:
void Main()
{
var maxBits = 50;
var shift = (sizeof(long) * 8) - maxBits;
var maxBitsMask = (0xffffffffffffffff >> shift) << shift;
var doubles = new []{ 1412.04125, 22.19412347, 3.1415926};
foreach (var d in doubles)
{
var local = d;
unsafe
{
var toLong = *(ulong*)(&local);
var modLong = toLong & maxBitsMask;
local = *(double*)(&modLong);
}
Console.WriteLine("Was {0}, now {1}", d, local);
}
}
哇......我没有接受。 :)
为了完整起见,这里使用的是Jeppe的“不安全”方法:
void Main()
{
var maxBits = 50;
var shift = (sizeof(long) * 8) - maxBits;
var maxBitsMask = (long)((0xffffffffffffffff >> shift) << shift);
var doubles = new []{ 1412.04125, 22.19412347, 3.1415926};
foreach (var d in doubles)
{
var local = d;
var asLong = BitConverter.DoubleToInt64Bits(d);
var modLong = asLong & maxBitsMask;
local = BitConverter.Int64BitsToDouble(modLong);
Console.WriteLine("Was {0}, now {1}", d, local);
}
}