Question

def func(row):
if row.GT_x == row.GT_y or row.GT_x == row.GT_y[::-1]:
    return 2
elif len(set(row.GT_x) & set(row.GT_y)) != 0:
    return 1
else:
    return 0

%%timeit
merged_df['Decision'] = merged_df.apply(func, axis=1)

1 loop, best of 3: 30.2 s per loop

我将为所有数据帧行应用“func”，行数约为650,000。

我猜pandas.apply（）比for循环迭代需要更多的时间。

我也尝试过lambda函数而不是“func”，但结果是一样的。

我的数据框有两列名为GT_x，GT_y 并且，它具有“AA”或“BB”。函数“func”检测GT_x和GT_y相同，返回2，如果其中一个匹配，则返回1，否则返回0。

而且，我将使用应用函数“func”

创建另一列（决策）

你能推荐另一种更快的方法吗？

+

以下是我的样本数据

GT_x    GT_y

0 AG GA

1 AA GA

2 AA GG

3 GG GG

...

65000 GG GG

索引0结果应为2， index 1结果应为1，索引2的结果应为0，索引3和65,000结果也应该是2

Answer 1

您可以使用df.apply（func，axis = 1，raw = True）来加快计算速度（在这种情况下，你的函数的输入将是原始numpy数组而不是系列）

来自应用功能描述：

raw : boolean, default False
If False, convert each row or column into a Series. If raw=True the 
passed function will receive ndarray objects instead. If you are just a 
applying a NumPy reduction function this will achieve much better 
performance

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

如何为大数据加速pandas dataframe.apply（）

1 个答案: