Question

我正在尝试在Python 3中绑定Pandas数据帧，以便在大型数据集上进行更有效的分组。目前，性能瓶颈在于使用.apply（）方法迭代数据帧。

列中的所有条目都是十六进制的，因此看起来pd.to_numeric函数应该完全符合我的要求。

我尝试过多种选择，但到目前为止还没有任何效果。

#  This sets all values to np.nan with coerced errors, 'Unable to parse string' with raise errors.
dataframe[bin] = pd.to_numeric(dataframe[to_bin], errors='coerce') % __NUM_BINS__ 

# Gives me "int() Cannot convert non-string with explicit base"
dataframe[bin] = int(dataframe[to_bin].astype(str), 16) % __NUM_BINS__

# Value Error: Invalid literal for int with base 10 'ffffffffff'
dataframe[bin] = dataframe.astype(np.int64) % __NUM_BINS__

有什么建议吗？这似乎是人们过去必须解决的问题。

Answer 1

在上述评论的一些帮助之后：更快的方法是使用生成器函数。这样，如果提供的数据无法从十六进制转换，它可以处理任何异常。

def bin_vals(lst):
    for item in lst:
        try:
             yield int(item, 16) % __NUM_BINS__
        except:
             yield __ERROR_BIN__ #whatever you store weird items in

然后在您的转换部分中，您将执行以下操作：

dataframe['binned_value'] = [bin for bin in bin_vals(df['val_to_bin'].tolist())]

这导致了迭代每一行的大幅加速。它也比我原来使用的apply方法更快。

在没有迭代的情况下转换Pandas中的Hex列

1 个答案: