从另一数据框列减去一个数据框列以获取多个列

时间:2019-06-21 19:06:24

标签: python dataframe

我正在尝试从一个数据框的另一列的数据框中减去一列,我想对n个列进行此操作(目前我正在处理的数据框有1000列)。

这是两个数据框的外观:

数据框1:

               branch A (pkg XYZ) | branch A (pkg ABC)| branch B (pkg XYZ)
               -------------------------------------------------------------
5/21 - 5/27           20          |        30         |        50
5/28 - 6/02           10          |        30         |        50
6/03 - 6/09           30          |        40         |        50
6/10 - 6/16           20          |        30         |        50
6/17 - 6/23           50          |        10         |        50

数据框2:

                branch A (pkg XYZ)| branch A (pkg ABC) | branch B (pkg XYZ)
                -----------------------------------------------------------
5/21 - 5/27           3           |        5           |        50
5/28 - 6/02           2           |        6           |        50
6/03 - 6/09           3           |        7           |        50
6/10 - 6/16           1           |        2           |        50
6/17 - 6/23           4           |        0           |        50

如果我想从 Dataframe 1 中减去 Dataframe 2 的所有列,则取决于它们的列标题(“ 分支A(pkg XYZ)< / em>“)匹配,最有效的方法是什么?

我尝试遍历一个数据帧列的列表,然后以这种方式从另一数据帧中减去它,但这似乎效率很低。

  i = 0
  df1_cols = list(df1)
  while i < len(df1.columns):
       col_name = df1_cols[i]

       # df3 is an empty dataframe
       df3[col_name] = df1[col_name] - df2[col_name]

       i += 1

1 个答案:

答案 0 :(得分:0)

您可以使用Dataframe.subtract减去两个数据框中的列。我们遍历df2中的列,如果在df1中找到该列,则在该列中执行减法。最后,我们将结果保存在名称以“ Result”结尾的单独列中。

In [1]: import pandas as pd

In [2]: df1 = pd.DataFrame({"branch A(pkg XYZ)":[20,10,30,20,50], "branch A(pkg ABC)":[30,30,40,30,10], "branch B(pkg X
   ...: YZ)":[50, 50, 50, 50, 50]})

In [3]: df1
Out[3]:
   branch A(pkg XYZ)  branch A(pkg ABC)  branch B(pkg XYZ)
0                 20                 30                 50
1                 10                 30                 50
2                 30                 40                 50
3                 20                 30                 50
4                 50                 10                 50

In [4]: df2 = pd.DataFrame({"branch A(pkg XYZ)":[3,2,3,1,4], "branch A(pkg ABC)":[5,6,7,2,0], "branch B(pkg XYZ)":[50,5
   ...: 0,50,50,50]})

In [5]: df2
Out[5]:
   branch A(pkg XYZ)  branch A(pkg ABC)  branch B(pkg XYZ)
0                  3                  5                 50
1                  2                  6                 50
2                  3                  7                 50
3                  1                  2                 50
4                  4                  0                 50

In [25]: for i in df2.columns:
    ...:     if i in df1.columns:
    ...:         df2[i+"Result"] = df2[i].subtract(df1[i], fill_value=0)

In [29]: df2
Out[29]:
   branch A(pkg XYZ)  branch A(pkg ABC)  branch B(pkg XYZ)  \
0                  3                  5                 50
1                  2                  6                 50
2                  3                  7                 50
3                  1                  2                 50
4                  4                  0                 50

   branch A(pkg XYZ)Result  branch A(pkg ABC)Result  branch B(pkg XYZ)Result
0                      -17                      -25                        0
1                       -8                      -24                        0
2                      -27                      -33                        0
3                      -19                      -28                        0
4                      -46                      -10                        0

尝试1000列和100行也很有效:

In [40]: import numpy as np
In [41]: df1 = pd.DataFrame(np.random.random((100, 1000)))
In [42]: df2 = pd.DataFrame(np.random.random((100, 1000)))
In [45]: %%timeit
    ...: for i in df2.columns:
    ...:     if i in df1.columns:
    ...:         df2[str(i)+"Result"] = df2[i].subtract(df1[i], fill_value=0)
    ...:
    ...:
367 ms ± 97.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [49]: df2.head(5)
Out[49]:
          0         1         2         3         4         5         6  \
0  0.327470  0.272503  0.549897  0.119997  0.985847  0.445402  0.582878
1  0.752375  0.606053  0.223085  0.062001  0.025440  0.638872  0.188112
2  0.174401  0.944870  0.630128  0.715326  0.298661  0.285740  0.360253
3  0.095649  0.355365  0.523830  0.114555  0.342535  0.393107  0.246344
4  0.250579  0.105054  0.761075  0.574047  0.733976  0.199406  0.658025

          7         8         9  ...  990Result  991Result  992Result  \
0  0.335388  0.613710  0.104878  ...  -0.728738   0.147162  -0.841872
1  0.796243  0.709898  0.133040  ...  -0.151361  -0.400989   0.012670
2  0.009304  0.472587  0.108229  ...  -0.131590  -0.540945  -0.097455
3  0.798668  0.628953  0.701703  ...  -0.461036   0.217387  -0.363704
4  0.387475  0.152143  0.825989  ...  -0.021844   0.103296  -0.272207

   993Result  994Result  995Result  996Result  997Result  998Result  999Result
0   0.389068   0.470042   0.556146   0.705036  -0.021659   0.250586  -0.662487
1  -0.456462  -0.206587   0.691951  -0.507585  -0.430838  -0.126303  -0.001411
2  -0.018339   0.226750   0.483076  -0.581611  -0.362906   0.796857  -0.367914
3   0.323971  -0.779884  -0.306404  -0.825982  -0.065974  -0.109321  -0.023654
4   0.178328   0.600110   0.222539   0.064416  -0.110039  -0.615137  -0.261765

[5 rows x 2000 columns]