我们说我有以下用户入住酒店的数据:
end start uid
0 2014-01-02 00:00:00 2014-01-01 00:00:00 1
1 2014-01-04 00:00:00 2014-01-02 00:00:00 1
2 2014-02-02 00:00:00 2014-02-01 00:00:00 1
3 2014-01-02 00:00:00 2014-01-01 00:00:00 3
我想连接相隔1天或更短时间(同一用户)的连续停留,有效地创建以下数据帧:
end start uid
0 2014-01-04 00:00:00 2014-01-01 00:00:00 1
2 2014-02-02 00:00:00 2014-02-01 00:00:00 1
3 2014-01-02 00:00:00 2014-01-01 00:00:00 3
第一步是groupby("uid")
。但是,我如何遍历每个组的行,以便我可以使用pandas工具箱进行此连接?
为方便起见,这是数据帧的最小初始化:
import pandas as pd
from datetime import datetime
data = pd.DataFrame([{"uid":1,"start":datetime(year=2014,month=1,day=1),"end":datetime(year=2014,month=1,day=2)},{"uid":1,"start":datetime(year=2014,month=1,day=2),"end":datetime(year=2014,month=1,day=4)},{"uid":1,"start":datetime(year=2014,month=2,day=1),"end":datetime(year=2014,month=2,day=2)},{"uid":3,"start":datetime(year=2014,month=1,day=1),"end":datetime(year=2014,month=1,day=2)}])
答案 0 :(得分:0)
所以,这就是我解决这个问题的方法,不使用任何矢量化或特殊的熊猫功能。 此外,这假设数据在开始,结束时排序。
data["discard"] = False
grouped = data.groupby("uid")
uids = data.uid.unique()
maxdiff = 24 * 60 * 60
parts = []
for uid in uids:
group = grouped.get_group(uid)
for x in range(1,len(group)):
prev = group.end.iloc[x-1]
curr = group.start.iloc[x]
difference = int((curr-prev).total_seconds())
if difference<maxdiff:
group.start.iloc[x] = group.start.iloc[x-1]
group.discard.iloc[x-1] = True
parts.append(group)
new_df = pd.concat(parts)
new_df = new_df[new_df.discard==False]
del new_df["discard"]
print new_df