给出以下数据框:
data = [['2019-06-20 12:28:00', '05123', 2, 8888],
['2019-06-20 13:28:00', '55874', 6, 8888],
['2019-06-20 13:35:00', '12345', 1, 8888],
['2019-06-20 13:35:00', '35478', 2, 1234],
['2019-06-20 13:35:00', '12345', 2, 8888],
['2019-06-20 14:22:00', '98765', 1, 8888]]
columns = ['pdate', 'station', 'ptype', 'train']
df = pd.DataFrame(data, columns = columns)
其中“ pdate” =乘车时间,“ station” =乘车代号,“ ptype” =乘车类型,“ train” =火车号
'ptype'可以具有以下值(1 =到达,2 =出发,6 =通过)
这是结果:
pdate station ptype train
0 2019-06-20 12:28:00 05123 2 8888
1 2019-06-20 13:28:00 55874 6 8888
2 2019-06-20 13:35:00 12345 1 8888
3 2019-06-20 13:35:00 35478 2 1234
4 2019-06-20 13:35:00 12345 2 8888
5 2019-06-20 14:22:00 98765 1 8888
不幸的是,有时错误地在车站而不是注册“ ptype” = 6(通过),而是输入了“ ptype” = 1(到达)并且“ ptype” = 2(出发)发生在同一时间,所以这2条记录我必须考虑只是一张通行证记录
我必须从数据帧中删除具有ptype = 6 OR的每一行(ptype = 1,并且同一站点和同一列车编号的ptype = 2的下一条记录恰好同时发生)
因此,从给定的示例中,我必须删除以下行(1、2、4)
我可以毫无疑问地删除ptype = 6的所有行
df = df.drop(df[(df['ptype']==6)].index)
但是我不知道如何删除其他对。 有想法吗?
答案 0 :(得分:2)
IIUC,您可以执行groupby
和nunique
:
# convert to datetime. Skip if already is.
df.pdate = pd.to_datetime(df.pdate)
# drop all the 6 records:
df = df[df.ptype.ne(6)]
(df[df.groupby(['pdate','train'])
.ptype.transform('nunique').eq(1)]
)
输出:
pdate station ptype train
0 2019-06-20 12:28:00 05123 2 8888
3 2019-06-20 13:35:00 35478 2 1234
5 2019-06-20 14:22:00 98765 1 8888
答案 1 :(得分:0)
这是您的方法:
# We look at the problematic ptypes
# We groupby station train and pdate to identify the problematic rows
test = df[(df['ptype'] == 1) | (df['ptype'] == 2)].groupby(['station', 'train', 'pdate']).size().reset_index()
# If there is more than one row that means there is a duplicate
errors = test[test[0] >1][['station', 'train', 'pdate']]
# We create a column to_remove to later identify the problematic rows
errors['to_remove'] = 1
df = df.merge(errors, on=['station', 'train', 'pdate'], how='left')
#We drop the problematic rows
df = df.drop(index = df[df['to_remove'] == 1].index)
# We drop the column to_remove which is no longer necessary
df.drop(columns='to_remove', inplace = True)
输出:
pdate station ptype train
0 2019-06-20 12:28:00 05123 2 8888
1 2019-06-20 13:28:00 55874 6 8888
3 2019-06-20 13:35:00 35478 2 1234
5 2019-06-20 14:22:00 98765 1 8888
答案 2 :(得分:0)
这不是熊猫式的方法,但是如果我正确地理解了你的追求,就会真正得到想要的结果
# a dict for unique filtered records
filtered_records = {}
def unique_key(row):
return '%s-%s-%d' % (row[columns[0]],row[columns[1]],row[columns[3]])
# populate a map of unique dt, train, station records
for index, row in df.iterrows():
key = unique_key(row)
val = filtered_records.get(key,None)
if val is None:
filtered_records[key] = row[columns[2]]
else:
# is there's a 1 and 2 record, declare the record a 6
if val * row[columns[2]] == 2:
filtered_records[key] = 6
# helper function for apply
def update_row_ptype(row):
val = filtered_records[unique_key(row)]
return val if val == 6 else row[columns[2]]
# update the dataframe with invalid detected entries from the dict
df[columns[2]] = df.apply(lambda row: update_row_ptype(row), axis = 1)
# drop em
df.drop(df[(df[columns[2]]==6)].index,inplace=True)
print df
输出
pdate station ptype train
0 2019-06-20 12:28:00 05123 2 8888
3 2019-06-20 13:35:00 35478 2 1234
5 2019-06-20 14:22:00 98765 1 8888