Pandas将2个数据帧与所有列中的特定行进行比较

时间:2016-06-06 15:19:53

标签: python python-2.7 pandas dataframe

我有以下一些原始数据的Pandas数据框:

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 10000)

col_raw_headers = ['07_08_19 #1','07_08_19 #2','07_08_19 #2.1','11_31_19 #1','11_31_19 #1.1','11_31_19 #1.3','12_15_20 #1','12_15_20 #2','12_15_20 #2.1','12_15_20 #2.2']
col_raw_trial_info = ['Quantity1','Quantity2','Quantity3','Quantity4','Quantity5','Quantity6','TimeStamp',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]
cols_raw = [[1,75,9,7,-4,0.4,'07/08/2019 05:11'],[1,11,20,-17,12,0.8,'07/08/2019 10:54'],[2,0.9,17,102,56,0.6,'07/08/2019 21:04'],[1,70,4,75,0.8,0.4,'11/31/2019 11:15'],[2,60,74,41,-36,0.3,'11/31/2019 16:50'],[3,17,12,-89,30,0.1,'11/31/2019 21:33'],[1,6,34,496,-84,0.5,'12/15/2020 01:36'],[1,3,43,12,-23,0.5,'12/15/2020 07:01'],[2,5,92,17,64,0.5,'12/15/2020 11:15'],[3,7,11,62,-11,0.5,'12/15/2020 21:45']]
both_values = [[1,2,3,4,8,4,3,8,7],[6,5,3,7,3,23,27,3,11],[65,3,6,78,9,2,45,6,7],[4,3,6,8,3,5,66,32,84],[2,3,11,55,3,7,33,65,34],[22,1,6,32,5,6,4,3,898],[1,6,3,2,6,55,22,6,23],[34,37,46,918,0,37,91,12,68],[51,20,1,34,12,59,78,6,101],[12,71,34,94,1,73,46,51,21]]
processed_cols = ['c_1trial','14_1','14_2','8_1','8_2','8_3','28_1','24_1','24_2','24_3']

df_raw = pd.DataFrame(zip(*cols_raw))
df_temp = pd.DataFrame(zip(*both_values))
df_raw = pd.concat([df_raw,df_temp])
df_raw.columns=col_raw_headers
df_raw.insert(0,'Tr_id',col_raw_trial_info)
df_raw.reset_index(drop=True,inplace=True)

看起来像这样:

        Tr_id       07_08_19 #1       07_08_19 #2     07_08_19 #2.1       11_31_19 #1     11_31_19 #1.1     11_31_19 #1.3       12_15_20 #1       12_15_20 #2     12_15_20 #2.1     12_15_20 #2.2
0   Quantity1                 1                 1                 2                 1                 2                 3                 1                 1                 2                 3
1   Quantity2                75                11               0.9                70                60                17                 6                 3                 5                 7
2   Quantity3                 9                20                17                 4                74                12                34                43                92                11
3   Quantity4                 7               -17               102                75                41               -89               496                12                17                62
4   Quantity5                -4                12                56               0.8               -36                30               -84               -23                64               -11
5   Quantity6               0.4               0.8               0.6               0.4               0.3               0.1               0.5               0.5               0.5               0.5
6   TimeStamp  07/08/2019 05:11  07/08/2019 10:54  07/08/2019 21:04  11/31/2019 11:15  11/31/2019 16:50  11/31/2019 21:33  12/15/2020 01:36  12/15/2020 07:01  12/15/2020 11:15  12/15/2020 21:45
7         NaN                 1                 6                65                 4                 2                22                 1                34                51                12
8         NaN                 2                 5                 3                 3                 3                 1                 6                37                20                71
9         NaN                 3                 3                 6                 6                11                 6                 3                46                 1                34
10        NaN                 4                 7                78                 8                55                32                 2               918                34                94
11        NaN                 8                 3                 9                 3                 3                 5                 6                 0                12                 1
12        NaN                 4                23                 2                 5                 7                 6                55                37                59                73
13        NaN                 3                27                45                66                33                 4                22                91                78                46
14        NaN                 8                 3                 6                32                65                 3                 6                12                 6                51
15        NaN                 7                11                 7                84                34               898                23                68               101                21

我有一个单独的数据框,其中包含这些数字的处理版本:

  1. 上面的部分标题行已被删除
  2. 列名已更改
  3. 这是第二个数据帧:

    df_processed = pd.DataFrame(zip(*both_values),columns=processed_cols)
    df_processed = df_processed[[3,4,9,7,0,2,1,6,8,5]]
    
       8_1  8_2  24_3  24_1  c_1trial  14_2  14_1  28_1  24_2  8_3
    0    4    2    12    34         1    65     6     1    51   22
    1    3    3    71    37         2     3     5     6    20    1
    2    6   11    34    46         3     6     3     3     1    6
    3    8   55    94   918         4    78     7     2    34   32
    4    3    3     1     0         8     9     3     6    12    5
    5    5    7    73    37         4     2    23    55    59    6
    6   66   33    46    91         3    45    27    22    78    4
    7   32   65    51    12         8     6     3     6     6    3
    8   84   34    21    68         7     7    11    23   101  898
    

    每个数据框的公共部分:

    对于每一列,原始数据帧的第8行与处理后的数据帧的第1行相同。两个数据帧中的列顺序不同。

    输出组合:

    我希望将原始数据框dr_raw的第1-10列中的第8-16行与已处理的数据框df_processed进行比较。如果列彼此匹配,那么我想从1-7中提取df_raw的行df_processed和列标题。

    示例:

    c_1trial中的值仅匹配列07_08_19 #1中第8-16行的值。我想2步:(1)我想找到一些方法来确定这两列是否相互匹配,(2)如果2列彼此匹配,那么在样本输出中,我想从中选择行匹配列。

    以下是我希望获得的输出:

        Tr_id       07_08_19 #1       07_08_19 #2     07_08_19 #2.1       11_31_19 #1     11_31_19 #1.1     11_31_19 #1.3       12_15_20 #1       12_15_20 #2     12_15_20 #2.1     12_15_20 #2.2
    Quantity1                 1                 1                 2                 1                 2                 3                 1                 1                 2                 3
    Quantity2                75                11               0.9                70                60                17                 6                 3                 5                 7
    Quantity3                 9                20                17                 4                74                12                34                43                92                11
    Proc_Name          c_1trial              14_1              14_2               8_1               8_2               8_3              28_1              24_1              24_2              24_3
    Quantity4                 7               -17               102                75                41               -89               496                12                17                62
    Quantity5                -4                12                56               0.8               -36                30               -84               -23                64               -11
    Quantity6               0.4               0.8               0.6               0.4               0.3               0.1               0.5               0.5               0.5               0.5
    TimeStamp  07/08/2019 05:11  07/08/2019 10:54  07/08/2019 21:04  11/31/2019 11:15  11/31/2019 16:50  11/31/2019 21:33  12/15/2020 01:36  12/15/2020 07:01  12/15/2020 11:15  12/15/2020 21:45
    

    我的尝试给人带来麻烦:

    print (df_raw.iloc[7:,1:] == df_processed).all(axis=1)
    

    给出

    ValueError: Can only compare identically-labeled DataFrame objects
    

    print (df_raw.ix[7:].values == df_processed.values) #gives False
    

    给出

    False
    

    我第二次尝试的问题是我没有选择.all(axis=1)。当我进行比较时,我希望在每列的所有行中执行此操作,而不仅仅是一行。

    问题:

    有没有办法从这两个数据帧中选择我上面显示的输出?

3 个答案:

答案 0 :(得分:1)

这看起来像你正在寻找的输出吗?

原始数据框df

        Tr_id    07_08_19  07_08_19.1  07_08_19.2    11_31_19  11_31_19.1  
0   Quantity1           1           1           2           1           2   
1   Quantity2          75          11         0.9          70          60   
2   Quantity3           9          20          17           4          74   
3   Quantity4           7         -17         102          75          41   
4   Quantity5          -4          12          56         0.8         -36   
5   Quantity6         0.4         0.8         0.6         0.4         0.3   
6   TimeStamp  07/08/2019  07/08/2019  07/08/2019  11/31/2019  11/31/2019   
7         NaN           1           6          65           4           2   
8         NaN           2           5           3           3           3   
9         NaN           3           3           6           6          11   
10        NaN           4           7          78           8          55   
11        NaN           8           3           9           3           3   
12        NaN           4          23           2           5           7   
13        NaN           3          27          45          66          33   
14        NaN           8           3           6          32          65   
15        NaN           7          11           7          84          34   

    11_31_19.2    12_15_20  12_15_20.1  12_15_20.2  12_15_20.3  
0            3           1           1           2           3  
1           17           6           3           5           7  
2           12          34          43          92          11  
3          -89         496          12          17          62  
4           30         -84         -23          64         -11  
5          0.1         0.5         0.5         0.5         0.5  
6   11/31/2019  12/15/2020  12/15/2020  12/15/2020  12/15/2020  
7           22           1          34          51          12  
8            1           6          37          20          71  
9            6           3          46           1          34  
10          32           2         918          34          94  
11           5           6           0          12           1  
12           6          55          37          59          73  
13           4          22          91          78          46  
14           3           6          12           6          51  
15         898          23          68         101          21

已处理的数据框dfp

   8_1  8_2  24_3  24_1  c_1trial  14_2  14_1  28_1  24_2  8_3
0    4    2    12    34         1    65     6     1    51   22
1    3    3    71    37         2     3     5     6    20    1
2    6   11    34    46         3     6     3     3     1    6
3    8   55    94   918         4    78     7     2    34   32
4    3    3     1     0         8     9     3     6    12    5
5    5    7    73    37         4     2    23    55    59    6
6   66   33    46    91         3    45    27    22    78    4
7   32   65    51    12         8     6     3     6     6    3
8   84   34    21    68         7     7    11    23   101  898

代码:

df = pd.read_csv('raw_df.csv') # raw dataframe
dfp = pd.read_csv('processed_df.csv') # processed dataframe
dfr = df.drop('Tr_id', axis=1)

x = pd.DataFrame()
for col_raw in dfr.columns:
    for col_p in dfp.columns:
        if (dfr.tail(9).astype(int)[col_raw] == dfp[col_p]).all():
            series = dfr[col_raw].head(7).tolist()
            series.append(col_raw)
            x[col_p] = series

x = pd.concat([df['Tr_id'].head(7), x], axis=1)

输出:

       Tr_id    c_1trial        14_1        14_2         8_1         8_2  
0  Quantity1           1           1           2           1           2   
1  Quantity2          75          11         0.9          70          60   
2  Quantity3           9          20          17           4          74   
3  Quantity4           7         -17         102          75          41   
4  Quantity5          -4          12          56         0.8         -36   
5  Quantity6         0.4         0.8         0.6         0.4         0.3   
6  TimeStamp  07/08/2019  07/08/2019  07/08/2019  11/31/2019  11/31/2019   
7        NaN    07_08_19  07_08_19.1  07_08_19.2    11_31_19  11_31_19.1   

          8_3        28_1        24_1        24_2        24_3  
0           3           1           1           2           3  
1          17           6           3           5           7  
2          12          34          43          92          11  
3         -89         496          12          17          62  
4          30         -84         -23          64         -11  
5         0.1         0.5         0.5         0.5         0.5  
6  11/31/2019  12/15/2020  12/15/2020  12/15/2020  12/15/2020  
7  11_31_19.2    12_15_20  12_15_20.1  12_15_20.2  12_15_20.3 

我认为代码可能更简洁,但也许可以完成这项任务。

答案 1 :(得分:1)

替代解决方案,使用DataFrame.isin()方法:

In [171]: df1
Out[171]:
   a  b  c
0  1  1  3
1  0  2  4
2  4  2  2
3  0  3  3
4  0  4  4

In [172]: df2
Out[172]:
   a  b  c
0  0  3  3
1  1  1  1
2  0  3  4
3  4  2  3
4  0  4  4

In [173]: common = pd.merge(df1, df2)

In [174]: common
Out[174]:
   a  b  c
0  0  3  3
1  0  4  4

In [175]: df1[df1.isin(common.to_dict('list')).all(axis=1)]
Out[175]:
   a  b  c
3  0  3  3
4  0  4  4

或者如果要从第一个数据集中减去第二个数据集。即Pandas等同于SQL:

select col1, .., colN from tableA
minus
select col1, .., colN from tableB
在Pandas中

In [176]: df1[~df1.isin(common.to_dict('list')).all(axis=1)]
Out[176]:
   a  b  c
0  1  1  3
1  0  2  4
2  4  2  2

答案 2 :(得分:0)

我想出了这个使用循环。这非常令人失望:

holder = []
for randm,pp in enumerate(list(df_processed)):
    list1 = df_processed[pp].tolist()
    for car,rr in enumerate(list(df_raw)):
        list2 = df_raw.loc[7:,rr].tolist()
        if list1==list2:
            holder.append([rr,pp])

df_intermediate = pd.DataFrame(holder,columns=['A','B'])
df_c = df_raw.loc[:6,df_intermediate.iloc[:,0].tolist()]
df_c.loc[df_c.shape[0]] = df_intermediate.iloc[:,1].tolist()
df_c.insert(0,list(df_raw)[0],df_raw[list(df_raw)[0]])
df_c.iloc[-1,0]='Proc_Name'
df_c = df_c.reindex([0,1,2]+[7]+[3,4,5,6]).reset_index(drop=True)

输出:

       Tr_id       11_31_19 #1     11_31_19 #1.1     12_15_20 #2.2       12_15_20 #2       07_08_19 #1     07_08_19 #2.1       07_08_19 #2       12_15_20 #1     12_15_20 #2.1     11_31_19 #1.3
0  Quantity1                 1                 2                 3                 1                 1                 2                 1                 1                 2                 3
1  Quantity2                70                60                 7                 3                75               0.9                11                 6                 5                17
2  Quantity3                 4                74                11                43                 9                17                20                34                92                12
3  Proc_Name               8_1               8_2              24_3              24_1          c_1trial              14_2              14_1              28_1              24_2               8_3
4  Quantity4                75                41                62                12                 7               102               -17               496                17               -89
5  Quantity5               0.8               -36               -11               -23                -4                56                12               -84                64                30
6  Quantity6               0.4               0.3               0.5               0.5               0.4               0.6               0.8               0.5               0.5               0.1
7  TimeStamp  11/31/2019 11:15  11/31/2019 16:50  12/15/2020 21:45  12/15/2020 07:01  07/08/2019 05:11  07/08/2019 21:04  07/08/2019 10:54  12/15/2020 01:36  12/15/2020 11:15  11/31/2019 21:33

列的顺序与我要求的顺序不同,但这是一个小问题。

这种方法的真正问题是使用循环。 我希望有一个更好的方法来使用一些内置的Pandas功能。如果您有更好的解决方案,发布。谢谢。