Question

所以我有大量的数据被分类，看起来有点......慢？

我做了一个最小的例子，它模仿了较小子集的数据点和计算箱的数量：

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import time

np.random.seed(1)


n_samples = 37000
n_bins    = 91000

data = pd.Series(np.random.gamma(1, 1, n_samples))

t1 = time.time()
binned_df = pd.cut(data, bins = n_bins, precision = 100).value_counts()
t2 = time.time()
print("pd.cut speed: {}".format(t2-t1))


summed = np.sum(binned_df)
print("sum: {:.4f}".format(summed))
print("len: {}".format(len(binned_df)))
print(binned_df.head())

plt.hist(data, bins = 100)
plt.show()

如果我将pd.cut()的精度设置为100，那么我的计算机上的脚本大约需要1.5秒，而且我会得到非常精确的分档，例如(0.209274211931, 0.209375434515]。但是，如果我将精度设置为1，则相同的动作大约需要9.2秒，因此速度要慢得多，现在这些动作仅被定义为例如(0.2093, 0.2094]。

但为什么更高的精度计算得更快？我误解了这里发生的事情吗？

Answer 1

查看源代码，看起来给pandas一个高于19的精度可以让你跳过一个原本会运行的循环（前提是你的dtype不是datetime64或timedelta64;请参阅Line 326）。相关代码开始on Line 393 and goes to Line 415。双重评论是我的：

## This function figures out how far to round the bins after decimal place
def _round_frac(x, precision):
    """
    Round the fractional part of the given number
    """
    if not np.isfinite(x) or x == 0:
        return x
    else:
        frac, whole = np.modf(x)
        if whole == 0:
            digits = -int(np.floor(np.log10(abs(frac)))) - 1 + precision
        else:
            digits = precision
        return np.around(x, digits)

## This function loops through and makes the cuts more and more precise
## sequentially and only stops if either the number of unique levels created
## by the precision are equal to the number of bins or, if that doesn't
## work, just returns the precision you gave it. 

## However, range(100, 20) cannot loop so you jump to the end
def _infer_precision(base_precision, bins):
    """Infer an appropriate precision for _round_frac
    """
    for precision in range(base_precision, 20):
        levels = [_round_frac(b, precision) for b in bins]
        if algos.unique(levels).size == bins.size:
            return precision
    return base_precision # default

编辑：受控示例

假设您有一个列表my_list，它有六个要分割成三个区域的元素：

test = [1.121, 1.123, 1.131, 1.133, 1.141, 1.143]

显然，您希望在1.123和1.133之后进行拆分，但是说您没有直接提供pandas垃圾箱，而是放置垃圾箱数量（n_bins = 3）。假装pandas开始猜测，将数据均匀分割为3（注意：我不知道这是pandas选择初始削减的方式 - 这只是为了示例目的）：

# To calculate where the bin cuts start
x = (1.143 - 1.121)/3
cut1 = 1.121 + x  # 1.1283
cut2 = 1.121 + (2*x) # 1.1356
bins = [cut1, cut2]

但除此之外，假设您建议pandas使用精度为1.将此精度应用于上述切割会给您1.1 - 这对于分隔my_list没用，因为每个条目看起来都是比如1.1。所以包需要经过并在估计的切割值上使用越来越多的十进制数，直到得到的水平数匹配n_bins：

# Adapted from infer_precision
for precision in range(1, 4):     
    levels = [_round_frac(b, precision) for b in bins]
    print levels

此过程仅在唯一级别数与箱数匹配时停止，或者达到20个小数位。提供100的精度允许包在小数点后使用100个位置，以便在数据中的更多和更精确值之间剔除其切割值。

有人可以解释一下Pandas bin的精度吗？

1 个答案: