如何在熊猫数据框列上应用条件?

时间:2020-03-17 16:12:27

标签: pandas dataframe if-statement

我有一个熊猫数据框,其中包含有关拒绝的信息。关于此问题的一些背景知识,电子邮件发件人可以多次发送同一封电子邮件,但只能解决一次。我仍要在新列中说明与“已解决”具有相同发件人和消息的电子邮件。

起始数据帧如下:

data = [['Sent from automated email', 'jim@yahoo.com', 'Resolved','2020-01-13 07:06:34'], 
        ['Sent from automated email', 'jim@yahoo.com', 'Rejected','2020-01-13 07:06:39'], 
        ['Hello I would like for you to make an update please','new101@cnn.com', 'Resolved', '2020-02-14 09:06:39'], 
        ['Hello I would like for you to make an update please','new101@cnn.com', 'Rejected', '2020-02-14 09:06:41'],
        ['Hello I would like for you to make an update please','new101@cnn.com', 'Resolved', '2020-02-14 09:06:59'],
        ['Take one newspaper','notneeded@gmail.com', 'Resolved', '2020-02-17 09:05:39'],
        ['Hey hows it going','jamie@gmail.com', 'Rejected', '2020-03-12 09:03:42'],
        ] 

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Message', 'Email','Resolution','Time Sent']) 

我想接收所有具有相同发件人和相同消息,但解决方案不同的电子邮件,如果以前的任何电子邮件已解决,则将它们标记为“已解决”。我想要的输出是:

data = [['Sent from automated email', 'jim@yahoo.com', 'Resolved','2020-01-13 07:06:34','Resolved' ], 
        ['Sent from automated email', 'jim@yahoo.com', 'Rejected','2020-01-13 07:06:39','Resolved'], 
        ['Hello I would like for you to make an update please','new101@cnn.com', 'Resolved', '2020-02-14 09:06:39','Resolved'], 
        ['Hello I would like for you to make an update please','new101@cnn.com', 'Rejected', '2020-02-14 09:06:41','Resolved'],
        ['Hello I would like for you to make an update please','new101@cnn.com', 'Resolved', '2020-02-14 09:06:59','Resolved'],
        ] 

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Message', 'Email','Resolution','Time Sent','Real Resolution']) 

我尝试编写如下函数:

    def a(df):
        if df[df['message'].duplicated()] & df[(df['resolution'] == 'Rejected') | (df['resolution'] == 'Resolved') ] & df[df['Email].duplicated()]:
           df['Real Resolution'] = 'Resolved' 

df['Real Resolution'] = df.apply(a)

我认为这是不正确的,因为我不仅仅考虑已解决然后被拒绝的重复邮件。有小费吗?谢谢!

1 个答案:

答案 0 :(得分:1)

IIUC,您可以尝试以下操作:

c = df[['Message','Email']].duplicated(keep=False) #check duplicate in Message+Email
c1 = df[['Message','Email','Resolution']].duplicated(keep=False) #check resolution too
#condition is if c is True and c1 is False then check if email group has any True
df.loc[(c & ~c1).groupby(df['Email']).transform('any'),'Real Resolution'] = 'Resolved'

out = df.dropna(subset=['Real Resolution']).copy()
print(out)

                                             Message           Email  \
0                          Sent from automated email   jim@yahoo.com   
1                          Sent from automated email   jim@yahoo.com   
2  Hello I would like for you to make an update p...  new101@cnn.com   
3  Hello I would like for you to make an update p...  new101@cnn.com   
4  Hello I would like for you to make an update p...  new101@cnn.com   

  Resolution            Time Sent Real Resolution  
0   Resolved  2020-01-13 07:06:34        Resolved  
1   Rejected  2020-01-13 07:06:39        Resolved  
2   Resolved  2020-02-14 09:06:39        Resolved  
3   Rejected  2020-02-14 09:06:41        Resolved  
4   Resolved  2020-02-14 09:06:59        Resolved