Question

在Python中，假设我有连续变量x和y，它们的值限制在0和1之间（以使其更容易）。我一直以为，如果我想将这些变量转换为具有0,0.01,0.02，...，0.98,0.99,1的bin的序数值，则可以将原始值四舍五入到第二位。出于某种原因，当我这样做时，它会留下工件。

让我说明问题（但是请注意，我的问题不是如何获取正确的绘图，而是实际上如何进行正确的装箱）。首先，这些是重现问题的唯一模块：

import numpy as np
import matplotlib.pyplot as plt

现在，假设我们连续产生了如下数据（其他数据生成过程也会产生相同的问题）：

# number of points drawn from Gaussian dists.:
n = 100000
x = np.random.normal(0, 2, n)
y = np.random.normal(4, 5, n)

# normalizing x and y to bound them between 0 and 1
# (it's way easier to illustrate the problem this way)
x = (x - min(x))/(max(x) - min(x))
y = (y - min(y))/(max(y) - min(y))

然后，仅通过应用四舍五入，就可以在上述间隔内将x和y转换为序数。然后，让我们将结果存储到x x y矩阵中，以绘制其热图出于说明目的：

# matrix that will represent the bins. Notice that the
# desired bins are every 0.01, from 0 to 1, so 100 bins:
mtx = np.zeros([100,100])
for i in range(n):
    # my idea was that I could roughly get the bins by
    # simply rounding to the 2nd decimal point:
    posX = round(x[i], 2)
    posY = round(y[i], 2)
    mtx[int(posX*100)-1, int(posY*100)-1] += 1

我希望上面的方法能起作用，但是当我绘制矩阵mtx的内容时，实际上得到了奇怪的伪像。代码：

# notice, however, the weird close-to-empty lines at
# 0.30 and 0.59 of both x and y. This happens regardless
# of how I generate x and y. Regardless of distributions
# or of number of points (even if it obviously becomes
# impossible to see if there are too few points):
plt.matshow(mtx, cmap=plt.cm.jet)
plt.show(block=False)

给我：

最奇怪的是，无论我使用哪种分布来生成x和y或使用哪种种子作为RNG，我总是在0.30处获得相同的水平和垂直近空线x和y的0.59，通常这些线与显示点集中的线立即平行（如您在图像中看到的那样）。

当我从那个矩阵到控制台将值逐个打印时，我实际上可以确认与那些接近空的行相对应的那些确实是零或非常接近零-与它们的相邻点不同。

我的问题可以更恰当地分为两部分：

为什么会发生以上情况？我真的很想了解在简单的代码中究竟是什么导致了这样的问题。
通过x矩阵生成y的更好方法是什么，该矩阵根据切点0、0.01、0.02，...对值进行分箱... ，0.98,0.99,1而没有在上面留下伪像？

如果要轻松地直接将上面使用的整个示例代码完整地抓取，请点击以下链接： https://www.codepile.net/pile/VLAq4kLp

注意：我不想找到正确的绘图方法。我想找到我自己来生成上面表示的“合并值矩阵”的正确方法。我知道，还有其他方法可以在没有伪影的情况下完成热图绘制，例如使用plt.matshow(mtx, cmap=plt.cm.jet); plt.show(block=False)或plt.hist2d(x, y, bins=100)。我要问的是矩阵生成本身中的问题在哪里，它创建了那些接近零的元素。

Answer 1

使用np.histogram2d(x,y, bins=100)可以轻松解决问题。

此答案的其余部分是显示手动算法失败的地方：

考虑一下数字

0.56*100 == 56.00000000000001    -> int(0.56*100) == 56
0.57*100 == 56.99999999999999    -> int(0.57*100) == 56
0.58*100 == 57.99999999999999    -> int(0.58*100) == 57
0.59*100 == 59.00000000000000    -> int(0.59*100) == 59

，这样数字58根本不会出现在索引中，而数字56的出现频率则是两倍（用于均匀分布）。

您可以改为先相乘，然后截断为整数。还要注意，最后一个垃圾箱需要关闭，这样会将值1添加到索引为99的垃圾箱中。

mtx = np.zeros([100,100])
for i in range(n):
    posX = int(x[i]*100)
    posY = int(y[i]*100)
    if posX == 100:
        posX = 99
    if posY == 100:
        posY = 99
    mtx[posX, posY] += 1

这将通过边缘定义垃圾箱，即第一个垃圾箱的范围是0到1等。在调用imshow / matshow时，您需要通过设置范围来考虑这一点。

plt.matshow(mtx, cmap=plt.cm.jet, extent=(0,100,0,100))

Answer 2

您的方法存在的问题是浮点错误。当您尝试将舍入的数字转换为整数时，这一点变得显而易见。考虑以下函数（实际上是您对每个随机数所做的事情）：

def int_round(a):
     r = round(a, 2)
     rh = r*100
     i = int(rh)
     print(r, rh, i)


int_round(0.27)
#prints: 0.27 27.0 27

int_round(0.28)
#prints: 0.28 28.000000000000004 28

int_round(0.29)
#prints: 0.29 28.999999999999996 28

int_round(0.30)
#prints: 0.3 30.0 30

如您所见，由于舍入0.28和0.29并乘以100后的浮点误差，0.28和0.29都以28的整数结尾。（这是因为int()总是四舍五入，因此28.99999999999变为28。）

一个解决方案可能是将值乘以100后四舍五入：

def round_int(a):
    ah = a*100
    rh = round(ah, 2)
    i = int(rh)
    print(ah, rh, i)

round_int(0.27)
#prints: 27.0 27.0 27

round_int(0.28)
#prints: 28.000000000000004 28.0 28

round_int(0.29)
#prints: 28.999999999999996 29.0 29

round_int(0.30)
#prints: 30.0 30.0 30

请注意，在这种情况下，0.29被更正为29。

将此逻辑应用于您的代码：我们可以将for循环更改为：

mtx = np.zeros([101, 101])

for i in range(n):
    # my idea was that I could roughly get the bins by
    # simply rounding to the 2nd decimal point:
    posX = np.round(100*x[i], 2)
    posY = np.round(100*y[i], 2)
    mtx[int(posX), int(posY)] += 1

请注意，当x = 1或y = 1时，将增加的bin数量增加到101，以说明最终的bin。此外，在这里您可以看到，在四舍五入之前，我们将x[i]和y[i]乘以100，可以正确进行合并：

Answer 3

到目前为止，由于我仍在第一部分中寻找错误，因此我只能正确回答您的第二个问题。

这是您可以根据需要为二进制文件选择的标准解决方案（假设您之前提到的x和y）

h = plt.hist2d(x, y, bins=100)

给予

这是一个100x100的网格。

变量h现在包含所需的矩阵以及matplotlib找到的bin。 plt.matshow(h[0])显示与图中相同的矩阵，该矩阵由matplotlib返回。如评论中所述：通过调用

可以获得相同的结果（但没有自动绘图）

h = np.histogram2d(x, y, bins=100)

尽管如此，您的算法还是不正确的，因为您实际上是在计算边缘上的项目数量，而不是计算它们之间的数量，因此您获得了101个项目每个方向。例如，当posX==0时，您会看到问题：然后int(posX*100)-1产生-1。

Answer 4

我不知道如何准确回答您的第一个问题。但是对于合并项目，我也使用pandas.cut。对于您的解决方案，您可以做到

import pandas as pd
bins = [v / 100. for v in range(100)
bucketed = pd.cut(x, bins)

bucketed将指示每个数据点属于哪个间隔

作为参考，这是一个不错的教程http://benalexkeen.com/bucketing-continuous-variables-in-pandas/

使用round（）合并连续值会产生伪像

4 个答案: