我有一个数据框,其中一列代表一些数据,另一列代表我要从数据中删除的索引。因此,从此开始:
import pandas as pd
import numpy as np
df = pd.DataFrame({'data':[np.arange(1,5),np.arange(3)],'to_delete': [np.array([2]),np.array([0,2])]})
df
>>>> data to_delete
[1,2,3,4] [2]
[0,1,2] [0,2]
这就是我要结束的事情:
new_df
>>>> data to_delete
[1,2,4] [2]
[1] [0,2]
我可以手动遍历行,并像这样计算每个数据的新数据:
new_data = []
for _,v in df.iterrows():
foo = np.delete(v['data'],v['to_delete'])
new_data.append(foo)
df.assign(data=new_data)
但是我正在寻找一种更好的方法。
答案 0 :(得分:2)
为每行调用numpy函数产生的开销将确实恶化此处的性能。我建议您改用列表:
df['data'] = [[j for ix, j in enumerate(i[0]) if ix not in i[1]]
for i in df.values]
print(df)
data to_delete
0 [1, 2, 4] [2]
1 [1] [0, 2]
在20K
行数据帧上的计时:
df_large = pd.concat([df]*10000, axis=0)
%timeit [[j for ix, j in enumerate(i[0]) if ix not in i[1]]
for i in df_large.values]
# 184 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
new_data = []
for _,v in df_large.iterrows():
foo = np.delete(v['data'],v['to_delete'])
new_data.append(foo)
# 5.44 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df_large.apply(lambda row: np.delete(row["data"],
row["to_delete"]), axis=1)
# 5.29 s ± 340 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
答案 1 :(得分:1)
您应该使用apply
函数,以便将函数应用于数据框中的每一行:
df["data"] = df.apply(lambda row: np.delete(row["data"], row["to_delete"]), axis=1)
答案 2 :(得分:0)
此解决方案基于itertools模块中名为starmap的鲜为人知的工具。
检查其文档,值得一试!
import pandas as pd
import numpy as np
from itertools import starmap
df = pd.DataFrame({'data': [np.arange(1,5),np.arange(3)],
'to_delete': [np.array([2]),np.array([0,2])]})
# Solution:
df2 = df.copy()
A = list(starmap(lambda v,l: np.delete(v,l),
zip(df['data'],df['to_delete'])))
df2['data'] = pd.DataFrame(zip(A))
df2
打印出:
data to_delete
0 [1, 2, 4] [2]
1 [1] [0, 2]