熊猫:按复杂条件合并组内的两行

时间:2020-09-01 03:40:41

标签: python pandas dataframe loops group-by

我的df如下; 将熊猫作为pd导入

df = pd.DataFrame({
    "ID": ['company A', 'company A', 'company A', 'company B','company B', 'company B', 'company C', 'company C','company C','company C', 'company D', 'company D','company D'],
    'Sender': [28, 'delete', 'flag_source', 56, 28, 312, 'delete', 'flag_source', 78, 102, 26, 101, 96],
    'Receiver': [129, 28, 'delete', 172, 56, 28, 61, 'delete', 12, 78, 98, 26, 101],
    'Date': ['2020-04-12', '2020-03-20', '2020-03-20', '2019-02-11', '2019-01-31', '2018-04-02', '2020-06-29', '2020-06-29', '2019-11-29', '2019-10-01', '2020-04-03', '2020-01-30', '2019-10-18'],
    'Sender_type': ['house', 'temp', 'house', 'house', 'house', 'house', 'temp', 'house', 'house','house','house', 'temp', 'house'],
    'Receiver_type': ['house', 'house', 'temp', 'house','house','house','house', 'temp', 'house','house','house','house','temp'],
    'Price': [32, 50, 47, 21, 23, 19, 52, 39, 12, 22, 61, 53, 19]
})

是这样的:

           ID       Sender Receiver        Date Sender_type Receiver_type  Price  
0   company A           28      129  2020-04-12       house         house  32 
1   company A       delete       28  2020-03-20        temp         house  50 # combine this row with below
2   company A  flag_source   delete  2020-03-20       house          temp  47 # combine this row with above
3   company B           56      172  2019-02-11       house         house  21 
4   company B           28       56  2019-01-31       house         house  23 
5   company B          312       28  2018-04-02       house         house  19 
6   company C       delete       61  2020-06-29        temp         house  52 # combine this row and below
7   company C  flag_source   delete  2020-06-29       house          temp  39 # combine this row with above
8   company C           78       12  2019-11-29       house         house  12 
9   company C          102       78  2019-10-01       house         house  22 
10  company D           26       98  2020-04-03       house         house  61 
11  company D          101       26  2020-01-30        temp         house  53 
12  company D           96      101  2019-10-18       house          temp  19 

我希望通过以下规则为每个组“ ID”(x公司)合并/合并两行:将“发件人”中包含“ flag_source”的行及其上面的行合并为一个新行。在此新行中:发件人是flag_source,'Revceiver'是其上面的值(删除两个'delete'值),Date是上面的日期,Sender_type和Receiver_type是'house',而'Price'是上面的上一个值。然后删除两行。例如,对于公司A,它将合并第1行和第2行以在下面生成新行:

ID        Sender        Receiver  Date        Sender_type  Receiver_type  Price
company A flag_source   28        2020-03-20  house        house          50

然后使用此新行替换前两行。其他组的规则相同(在这种情况下仅适用于公司A和C)。最后,我希望得到这样的结果:

           ID       Sender  Receiver        Date Sender_type Receiver_type  Price
0   company A           28       129  2020-04-12       house         house   32
1   company A  flag_source        28  2020-03-20       house         house   50 # new row
2   company B           56       172  2019-02-11       house         house   21
3   company B           28        56  2019-01-31       house         house   23
4   company B          312        28  2018-04-02       house         house   19
5   company C  flag_source        61  2020-06-29       house         house   52 # new row
6   company C           78        12  2019-11-29       house         house   12
7   company C          102        78  2019-10-01       house         house   22
8   company D           26        98  2020-04-03       house         house   61
9   company D          101        26  2020-01-30        temp         house   53
10  company D           96       101  2019-10-18       house          temp   19

希望我对这个问题的解释很清楚。

由于这是一个简短的示例,实际案例中有许多这样的数据,我编写了一个循环,但是非常缓慢且没有效果,因此,如果您有任何想法和有效的方法,请提供帮助。非常感谢您的帮助!

3 个答案:

答案 0 :(得分:1)

import pandas as pd

df = pd.DataFrame({
    "ID": ['company A', 'company A', 'company A', 'company B','company B', 'company B', 'company C', 'company C','company C','company C', 'company D', 'company D','company D'],
    'Sender': [28, 'delete', 'flag_source', 56, 28, 312, 'delete', 'flag_source', 78, 102, 26, 101, 96],
    'Receiver': [129, 28, 'delete', 172, 56, 28, 61, 'delete', 12, 78, 98, 26, 101],
    'Date': ['2020-04-12', '2020-03-20', '2020-03-20', '2019-02-11', '2019-01-31', '2018-04-02', '2020-06-29', '2020-06-29', '2019-11-29', '2019-10-01', '2020-04-03', '2020-01-30', '2019-10-18'],
    'Sender_type': ['house', 'temp', 'house', 'house', 'house', 'house', 'temp', 'house', 'house','house','house', 'temp', 'house'],
    'Receiver_type': ['house', 'house', 'temp', 'house','house','house','house', 'temp', 'house','house','house','house','temp'],
    'Price': [32, 50, 47, 21, 23, 19, 52, 39, 12, 22, 61, 53, 19]
})

flaggedData = (df[df["Sender"] == "flag_source"])

for i,row in flaggedData.iterrows():  # Row variable contains row having sender as flag_source

    deleteRow = df[df.index == i-1].values[0]   # delete variable contains row having sender as delete

    combined = [row[0],  # ID
                row[1],  # Sender
                deleteRow[2],  # Receiver
                deleteRow[3],  # Date
                row[4],  # Sender_type
                deleteRow[5],  # Receiver_type
                deleteRow[6]]  # Price

    df.loc[i-1] = combined  # replace with new values
    df = df.drop(index=i)  # drop old values

df = df.reset_index()  # resent index for better access on future.
print(df.loc[1])

我假设每个“删除”行都在“ flag_source”行的上方。如果您仍然不明白,请阅读评论。发表您的疑问。

答案 1 :(得分:0)

似乎您只需要删除每对的第二行并替换其余行中的某些值即可。

df = df[dd.Receiver == 'delete']
df.Sender = df.Sender.str.replace('delete', 'flag_source')
df.Sender_type = df.Sender_type.str.replace('temp', 'house')

答案 2 :(得分:0)

如果delete / flag_source始终位于同一日期,并且该日期+ ID上没有其他行,则可以在ID和日期上使用groupby聚合函数,以避免使用长循环。如果您的数据顺序不正确,则可以始终sort_values事先进行编辑。

cols = df.columns

new_df = df.groupby(['ID', 'Date']).aggregate({
    'Sender': 'last', 
    'Receiver': 'first', 
    'Sender_type': 'last', 
    'Receiver_type': 'first', 
    'Price': 'first'
    }).reset_index()

# Reorder as per original data
new_df[cols].sort_values(['ID', 'Date'], ascending=[1, 0])
相关问题