我有一个文件,其中包含来自wifi访问的多个数据。数据库分为多个列:user_id,开始(当设备连接到路由器时),结束(当设备与路由器断开连接时)。
示例:
user_id start end
1 15/05/16 13:51 15/05/16 14:06
1 15/05/16 14:06 15/05/16 14:32
1 15/05/16 14:32 15/05/16 14:34
2 15/05/16 11:14 15/05/16 11:25
2 15/05/16 11:25 15/05/16 12:09
2 15/05/16 12:14 15/05/16 12:42
2 15/05/16 17:33 15/05/16 17:41
2 15/05/16 17:41 15/05/16 18:27
问题是,有时设备会断开连接并重新连接。我想在这样的事件中对数据进行分组:
user_id start end
1 15/05/16 13:51 15/05/16 14:34
2 15/05/16 11:14 15/05/16 12:42
2 15/05/16 17:33 15/05/16 18:27
有没有一种有效的方法来处理大熊猫?
答案 0 :(得分:0)
我们可以有这样的东西
import pandas as pd
data=pd.DataFrame(
[
[1,'15/05/16 13:51','15/05/16 14:06'],
[1,'15/05/16 14:06','15/05/16 14:32'],
[1,'15/05/16 14:32','15/05/16 14:34'],
[2,'15/05/16 11:14','15/05/16 11:25'],
[2,'15/05/16 11:25','15/05/16 12:09'],
[2,'15/05/16 12:14','15/05/16 12:42'],
[2,'15/05/16 17:33','15/05/16 17:41'],
[2,'15/05/16 17:41','15/05/16 18:27']
]
,columns=['userid','start','end']
)
from datetime import datetime
data['start']=data['start'].map(lambda x: datetime.strptime(x,'%d/%m/%y %H:%M'))
data['end']=data['end'].map(lambda x: datetime.strptime(x,'%d/%m/%y %H:%M'))
diffData=[]
for i in range(1, len(data)):
diffData.append((data.loc[i,'start'] - data.loc[i-1,'end']).seconds / 60)
data['diff']=[0] + diffData
def getStartEnd(tempData,THRESHOLD):
tempData=tempData.reset_index()
finalData=[]
startTime=tempData.loc[0,'start']
for i in range(1,len(tempData)):
if(tempData.loc[i,'diff'] > THRESHOLD):
finalData.append([tempData.loc[i,'userid'],startTime,tempData.loc[i-1,'end']])
startTime=tempData.loc[i,'start']
finalData.append([tempData.loc[i,'userid'],startTime,tempData.loc[i,'end']])
return(pd.DataFrame(finalData,columns=['userid','start','end']))
finalData=pd.DataFrame(columns=['userid','start','end'])
for user in data['userid'].unique():
finalData=pd.concat([finalData,getStartEnd(data[data['userid']==user],60)])
print(finalData)
userid start end
0 1 2016-05-15 13:51:00 2016-05-15 14:34:00
0 2 2016-05-15 11:14:00 2016-05-15 12:42:00
1 2 2016-05-15 17:33:00 2016-05-15 18:27:00
答案 1 :(得分:0)
您可以在用户ID上使用pandas Groupby函数,如果您将每个用户ID数据分开,则计算开始和结束之间的差。然后将累积和应用于各个组,然后可以提取每个组的第一行的开始和最后一行的结束:-)
def func(threshold,df1):
# Calculating the difference of start and end time of each row
df1['diff1'] = ((df1.start - df1.end.shift()).dt.seconds).fillna(0)
# if difference is less than threshold equating with 0
df1.loc[df1['diff1'] < threshold, 'diff1'] = 0
# assigning cummulative sum of column
df1['diff1'] = df1.diff1.cumsum()
# Grouping the cummulatice sum of time differences and keeping only required row
df1 = df1.groupby(['diff1']).apply(lambda x: x.set_value(0,'end',x['end'].tail(1).values[0]).loc[x.head(1).index.values[0]])
return df1
data.start = pd.to_datetime(data.start)
data.end = pd.to_datetime(data.end)
# Threshold setting to consider the difference "threshold is in seconds"
threshold = 500
# Calling the function for each ID
data.groupby('userid').apply(lambda x: func(threshold,x))
出局:
userid start end diff1
userid diff1
1 0.0 1.0 2016-05-15 13:51:00 2016-05-15 14:34:00 0.0
2 0.0 2.0 2016-05-15 11:14:00 2016-05-15 11:25:00 0.0
2 17460.0 2.0 2016-05-15 11:14:00 2016-05-15 11:25:00 0.0
答案 2 :(得分:0)
首先,我们需要具有正确格式的列:“开始”和“结束”:
df[['start']] =pd.to_datetime(df['start'])
df[['end']] =pd.to_datetime(df['end'])
然后您需要生成一个新列以标识其他连接的条件:
df['id_connection'] = False
下一步是确定对新用户的首次观察(它将始终是新连接):
indexes = df.drop_duplicates(subset='user_id', keep='first').index
df.loc[indexes,'id_connection'] = True
现在,我们需要在产生新连接时确定另一个条件。您需要采用一个标准来确定它是否是新连接:
diff_ = (df['start'].values[1:] - df['end'].values[:-1]).astype('float')
time_criteria_mins = 5
new_connection = np.insert(( diff_ / (60*10**9)) > time_criteria_mins, 0, 1)
然后,您需要结合以下两个条件:(1)新用户(2)同一用户之间的连接时间大于5分钟:
df['id_connection'] = (new_connection | df['id_connection']).cumsum()
最后,我们通过'id_connection'属性进行分组:
gb = df.groupby('id_connection').agg({'user_id': 'first', 'start': 'first','end':'last'})
注意:请谨慎操作,以确保数据框按(用户和开始日期时间)排序