我正在尝试从一个数据框的另一列的数据框中减去一列,我想对n个列进行此操作(目前我正在处理的数据框有1000列)。
这是两个数据框的外观:
数据框1:
branch A (pkg XYZ) | branch A (pkg ABC)| branch B (pkg XYZ)
-------------------------------------------------------------
5/21 - 5/27 20 | 30 | 50
5/28 - 6/02 10 | 30 | 50
6/03 - 6/09 30 | 40 | 50
6/10 - 6/16 20 | 30 | 50
6/17 - 6/23 50 | 10 | 50
数据框2:
branch A (pkg XYZ)| branch A (pkg ABC) | branch B (pkg XYZ)
-----------------------------------------------------------
5/21 - 5/27 3 | 5 | 50
5/28 - 6/02 2 | 6 | 50
6/03 - 6/09 3 | 7 | 50
6/10 - 6/16 1 | 2 | 50
6/17 - 6/23 4 | 0 | 50
如果我想从 Dataframe 1 中减去 Dataframe 2 的所有列,则取决于它们的列标题(“ 分支A(pkg XYZ)< / em>“)匹配,最有效的方法是什么?
我尝试遍历一个数据帧列的列表,然后以这种方式从另一数据帧中减去它,但这似乎效率很低。
i = 0
df1_cols = list(df1)
while i < len(df1.columns):
col_name = df1_cols[i]
# df3 is an empty dataframe
df3[col_name] = df1[col_name] - df2[col_name]
i += 1
答案 0 :(得分:0)
您可以使用Dataframe.subtract减去两个数据框中的列。我们遍历df2
中的列,如果在df1
中找到该列,则在该列中执行减法。最后,我们将结果保存在名称以“ Result”结尾的单独列中。
In [1]: import pandas as pd
In [2]: df1 = pd.DataFrame({"branch A(pkg XYZ)":[20,10,30,20,50], "branch A(pkg ABC)":[30,30,40,30,10], "branch B(pkg X
...: YZ)":[50, 50, 50, 50, 50]})
In [3]: df1
Out[3]:
branch A(pkg XYZ) branch A(pkg ABC) branch B(pkg XYZ)
0 20 30 50
1 10 30 50
2 30 40 50
3 20 30 50
4 50 10 50
In [4]: df2 = pd.DataFrame({"branch A(pkg XYZ)":[3,2,3,1,4], "branch A(pkg ABC)":[5,6,7,2,0], "branch B(pkg XYZ)":[50,5
...: 0,50,50,50]})
In [5]: df2
Out[5]:
branch A(pkg XYZ) branch A(pkg ABC) branch B(pkg XYZ)
0 3 5 50
1 2 6 50
2 3 7 50
3 1 2 50
4 4 0 50
In [25]: for i in df2.columns:
...: if i in df1.columns:
...: df2[i+"Result"] = df2[i].subtract(df1[i], fill_value=0)
In [29]: df2
Out[29]:
branch A(pkg XYZ) branch A(pkg ABC) branch B(pkg XYZ) \
0 3 5 50
1 2 6 50
2 3 7 50
3 1 2 50
4 4 0 50
branch A(pkg XYZ)Result branch A(pkg ABC)Result branch B(pkg XYZ)Result
0 -17 -25 0
1 -8 -24 0
2 -27 -33 0
3 -19 -28 0
4 -46 -10 0
尝试1000列和100行也很有效:
In [40]: import numpy as np
In [41]: df1 = pd.DataFrame(np.random.random((100, 1000)))
In [42]: df2 = pd.DataFrame(np.random.random((100, 1000)))
In [45]: %%timeit
...: for i in df2.columns:
...: if i in df1.columns:
...: df2[str(i)+"Result"] = df2[i].subtract(df1[i], fill_value=0)
...:
...:
367 ms ± 97.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [49]: df2.head(5)
Out[49]:
0 1 2 3 4 5 6 \
0 0.327470 0.272503 0.549897 0.119997 0.985847 0.445402 0.582878
1 0.752375 0.606053 0.223085 0.062001 0.025440 0.638872 0.188112
2 0.174401 0.944870 0.630128 0.715326 0.298661 0.285740 0.360253
3 0.095649 0.355365 0.523830 0.114555 0.342535 0.393107 0.246344
4 0.250579 0.105054 0.761075 0.574047 0.733976 0.199406 0.658025
7 8 9 ... 990Result 991Result 992Result \
0 0.335388 0.613710 0.104878 ... -0.728738 0.147162 -0.841872
1 0.796243 0.709898 0.133040 ... -0.151361 -0.400989 0.012670
2 0.009304 0.472587 0.108229 ... -0.131590 -0.540945 -0.097455
3 0.798668 0.628953 0.701703 ... -0.461036 0.217387 -0.363704
4 0.387475 0.152143 0.825989 ... -0.021844 0.103296 -0.272207
993Result 994Result 995Result 996Result 997Result 998Result 999Result
0 0.389068 0.470042 0.556146 0.705036 -0.021659 0.250586 -0.662487
1 -0.456462 -0.206587 0.691951 -0.507585 -0.430838 -0.126303 -0.001411
2 -0.018339 0.226750 0.483076 -0.581611 -0.362906 0.796857 -0.367914
3 0.323971 -0.779884 -0.306404 -0.825982 -0.065974 -0.109321 -0.023654
4 0.178328 0.600110 0.222539 0.064416 -0.110039 -0.615137 -0.261765
[5 rows x 2000 columns]