Question

是否有python的图形库，它不需要将所有原始数据点存储为 numpy 数组或列表以绘制直方图？

我的数据集内存太大，我不想使用子采样来减少数据大小。

我正在寻找的是一个可以获取生成器输出的库（每个数据点来自一个文件，作为 float ），并构建一个直方图。

这包括计算bin大小，因为生成器会从文件中生成每个数据点。

如果此类库不存在，我想知道numpy是否能够从产生的数据点预先计算{bin_1:count_1, bin_2:count_2...bin_x:count_x}的计数器

数据点作为垂直矩阵保存在一个标签文件中，由node-node-score排列如下：

node   node   5.55555

更多信息：

104301133数据行（目前为止）
我不知道最小值或最大值
bin宽度应该相同
箱数可以是1000

尝试答案：

low = np.inf
high = -np.inf

# find the overall min/max
chunksize = 1000
loop = 0
for chunk in pd.read_table('gsl_test_1.txt', header=None, chunksize=chunksize, delimiter='\t'):
    low = np.minimum(chunk.iloc[:, 2].min(), low)
    high = np.maximum(chunk.iloc[:, 2].max(), high)
    loop += 1
lines = loop*chunksize

nbins = math.ceil(math.sqrt(lines))   

bin_edges = np.linspace(low, high, nbins + 1)
total = np.zeros(nbins, np.int64)  # np.ndarray filled with np.uint32 zeros, CHANGED TO int64


# iterate over your dataset in chunks of 1000 lines (increase or decrease this
# according to how much you can hold in memory)
for chunk in pd.read_table('gsl_test_1.txt', header=None, chunksize=2, delimiter='\t'):

    # compute bin counts over the 3rd column
    subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges)  # np.ndarray filled with np.int64

    # accumulate bin counts over chunks
    total += subtotal


plt.hist(bin_edges[:-1], bins=bin_edges, weights=total)
# plt.bar(np.arange(total.shape[0]), total, width=1)
plt.savefig('gsl_test_hist.svg')

输出：

Answer 1

您可以迭代数据集的块并使用np.histogram将bin计数累积到单个向量中（您需要先验地定义bin边缘并使用{将它们传递给np.histogram {1}}参数），例如：

bins=

如果你想确保你的容器跨越数组中的所有值，但你还不知道最小值和最大值，那么你需要事先循环一次以计算它们（例如使用{{ 1}} / import numpy as np import pandas as pd bin_edges = np.linspace(low, high, nbins + 1) total = np.zeros(nbins, np.uint) # iterate over your dataset in chunks of 1000 lines (increase or decrease this # according to how much you can hold in memory) for chunk in pd.read_table('/path/to/my/dataset.txt', header=None, chunksize=1000): # compute bin counts over the 3rd column subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges) # accumulate bin counts over chunks total += subtotal.astype(np.uint)），例如：

np.min

一旦获得了bin计数数组，就可以使用plt.bar直接生成条形图：

np.max

也可以将low = np.inf high = -np.inf # find the overall min/max for chunk in pd.read_table('/path/to/my/dataset.txt', header=None, chunksize=1000): low = np.minimum(chunk.iloc[:, 2].min(), low) high = np.maximum(chunk.iloc[:, 2].max(), high)参数用于plt.hist，以便从计数向量而不是样本生成直方图，例如：

plt.bar(bin_edges[:-1], total, width=1)

如何从文件过大的内存中构建（或预先计算）直方图？

1 个答案: