在细胞条件下在pandas中多次切片行

时间:2016-10-31 17:48:36

标签: python python-3.x pandas dataframe sklearn-pandas

Msgtype Date ConvID   message
enquire 12/1 689  I want your car
reply   12/3 689  it is available
reply   12/4 689  rent please?
reply   12/6 689  $200
accept  12/8 689  please pay through CC
reply   12/8 689  thank you, what about fuel?
reply   12/8 689  you have to take care
enquire 12/3 690  Looking for car
reply   12/4 690  available
accept  12/5 690  paid
reply   12/6 690  thank you

我想通过ConvID对这些数据进行分组,并按日期对其进行排序。我希望行直到“Msgtype”=接受该特定的ConvID。旨在分析消息数据,直到特定ConvID接受预订请求。所以对于ConvID = 689,我想要行直到“Msgtype”=接受。 “接受”之后的其余行不是必需的。

例如:ConvID = 689

不需要这两个
    Msgtype Date ConvID   message
    reply   12/8 689  thank you, what about fuel?
    reply   12/8 689  you have to take care

类似地,ConvID = 690

不需要此行
Msgtype Date ConvID   message
 reply   12/6 690  thank you

2 个答案:

答案 0 :(得分:1)

我认为你可以使用:

mask1 = (df.Msgtype == 'accept')
mask = mask1.groupby([df.ConvID]).apply(lambda x: x.shift().fillna(False).cumsum()) == 0

print (df[mask].sort_values(['ConvID','Date']))
   Msgtype  Date  ConvID                message
0  enquire  12/1     689        I want your car
1    reply  12/3     689        it is available
2    reply  12/4     689           rent please?
3    reply  12/6     689                   $200
4   accept  12/8     689  please pay through CC
7  enquire  12/3     690        Looking for car
8    reply  12/4     690              available
9   accept  12/5     690                   paid

说明:

#mask where is 'accept'
mask1 = (df.Msgtype == 'accept')
print (mask1)
0     False
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8     False
9      True
10    False
Name: Msgtype, dtype: bool

#per group shift, replace NaN by False and cumulative sum
print (mask1.groupby([df.ConvID]).apply(lambda x: x.shift().fillna(False).cumsum()))
0     0
1     0
2     0
3     0
4     0
5     1
6     1
7     0
8     0
9     0
10    1
Name: Msgtype, dtype: int32
#where output of groupby is 0 
mask = mask1.groupby([df.ConvID]).apply(lambda x: x.shift().fillna(False).cumsum()) == 0
print (mask)
0      True
1      True
2      True
3      True
4      True
5     False
6     False
7      True
8      True
9      True
10    False
Name: Msgtype, dtype: bool

#boolean indexing and sorting
print (df[mask].sort_values(['ConvID','Date']))
   Msgtype  Date  ConvID                message
0  enquire  12/1     689        I want your car
1    reply  12/3     689        it is available
2    reply  12/4     689           rent please?
3    reply  12/6     689                   $200
4   accept  12/8     689  please pay through CC
7  enquire  12/3     690        Looking for car
8    reply  12/4     690              available
9   accept  12/5     690                   paid

答案 1 :(得分:0)

易:

try:
    a = 5
    if a <= 10:
        raise ValueError    
except ValueError:
        print("Please enter a value greater than 10")

for name, grp in df.groupby('ConvID'): grp.sort_values('Date', inplace=True) accept_date = grp.loc[grp['Msgtype'] == 'accept', 'Date'] req = grp[grp['Date'] < accept_date] # Or, you can use index, like so: # grp = grp.sort_values('Date').reset_index(drop=True) # req = grp.iloc[:grp[grp['Msgtype'] == 'accept'].index.values[0], :] 将只包含您可用于分析的必需行。