Question

我有一个事件数据的数据帧，其列是该事件发生的时间间隔。我想使用pd.qcut()根据给定间隔中的事件来制作每个时间间隔的百分位数，并为每个事件指定各自的百分位数。

def event_quartiler(event_row):
    in_interval = paired_events.loc[events['TimeInterval'] == event_row['TimeInterval']]
    quartiles = pd.qcut(in_interval['DateTime'], 100)
    counter = 1
    for quartile in quartiles.unique():
        if(event_row['DateTime'] in quartile):
            return counter
        counter = counter+1
        if(counter > 100): break
    return -1

events['Quartile'] = events.apply(event_quartiler, axis=1)

我希望这会简单地将Quartile列设置为每个事件各自的百分位数，但是相反，代码需要花费很多时间才能运行，并且通过输出以下代码来有效地消除漏洞：

ValueError: ("Bin edges must be unique: array([1.55016605e+18, 1.55016616e+18, 1.55016627e+18, 1.55016632e+18,\n       1.55016632e+18, 1.55016636e+18,
... (I put the ellipsis here because there are 100 data points) 
1.55017534e+18, 1.55017545e+18,\n       1.55017555e+18]).\nYou can drop duplicate edges by setting the 'duplicates' kwarg", 'occurred at index 6539')

6539处的数据或其间隔内的任何事件都没有什么不同，但是我也找不到我的代码出了什么问题。

Answer 1

我解决了这个问题：qcut尝试将所有数据点自身拟合为四分位数，而cut则采用最小值和最大值，并拆分为n个bin。因为在此示例中，我尝试创建的四分位数要比实际数据点多，所以qcut失败了。

只需使用切成100个的垃圾箱就可以解决我的问题，并且我就能制造出百分位数。

为什么pd.qcut（）会产生巨大的边界？

1 个答案: