Question

我正在为Python开发一个新的标准直方图类（理想的是有助于numpy，因为在我的硕士期间执行密度估计时，使用标准实现遇到了许多严重的缺点）。

我设计了一个解决方案，我认为它对于组织可变大小的bin来说要更整洁，更通用，但是不幸的是，要以可靠的方式来执行直方图的要点-求和每个bin的记录数是很困难的。我在下面提供了一个简化的简化版本。

在示例代码中，该代码成功地累加了每个仓中的点数，您可以运行以下代码，它还会生成如下图：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as patches

class Custom_Histogram():

    def __init__(self, df, bins=None): 

        self.DF = df 
        self.bins = np.array(bins)
        self.n_dims = len(df.columns.values)

        ## Count number of datapoints in each bin
        self.hist = np.array([  
                    np.sum(     (self.DF.iloc[:,:] >= self.bins[:,:,0][i][:]) & 
                                (self.DF.iloc[:,:] <  self.bins[:,:,1][i][:])
                       )
                       for i in range(len(self.bins)) ], dtype=np.int32)[:,0]


## Generate Random Data
N = 200
X = np.random.normal(0.5,0.15,N)
Y = np.random.normal(0.5,0.05,N)
## Populate a Pandas DataFrame
DF = pd.DataFrame({'x':X,'y':Y})

## Hardcoded, contiguous, 2D variable-area bins
bins = np.array([
            [[0.0,0.2],[0.0,1.5]],
            [[0.2,0.4],[0.0,1.5]],
            [[0.4,0.6],[0.0,1.5]],
            [[0.6,0.8],[0.0,1.5]],
            [[0.8,1.0],[0.0,1.5]]
            ])

## Generate histogram using custom bins
Hist = Custom_Histogram(DF, bins)
print('Histogram: ', Hist.hist)

## 2D Plot
fig, axes = plt.subplots(figsize=(4, 3.5))

plt.scatter(DF.iloc[:,0], DF.iloc[:,1], 5, 'k')

# Create a patch for each bin and plot
for i,bin in enumerate(bins):
    rect = patches.Rectangle(   (bin[0][0],bin[1][0]),
                                bin[0][1]-bin[0][0],
                                bin[1][1]-bin[1][0],
                                linewidth=1,
                                edgecolor='r',facecolor='none')
    axes.add_patch(rect)
    axes.set_ylim(-0.5,2)
    axes.set_xlim(-0.5,1.5)

print('Histogram Sum: ', np.sum(Hist.hist))
print('Data Points: ', N)

plt.show()

对于更复杂的垃圾箱，点数不再正确，有些似乎被重复计算。尝试使用以下容器：

bins = np.array([
            [[0.0,0.2],[0.0,1.5]],
            [[0.2,0.4],[0.0,1.5]],
            [[0.4,0.6],[0.0,1.0]],
            [[0.4,0.6],[1.0,1.5]],
            [[0.6,0.8],[0.0,0.5]],
            [[0.6,0.8],[0.5,1.5]],
            [[0.8,1.0],[0.0,1.5]]
            ])

其绘制方式类似于：Complex bins plot，返回的点多于存在的点。

因此，我想知道如何更改围绕数据点计数的逻辑，但是要以有效的矢量方式保持它（我确信这里不需要任何循环）。我还需要用它来保持不同的维度，这似乎已经差不多起作用了（切片的结构正确地执行了每个维度中的条件，但是发生了其他事情，导致与上面相同的重复计数）。 / p>

如何使用矢量化条件计数构造Python直方图

0 个答案: