Python:将轨迹分成几步

时间:2017-04-09 03:50:16

标签: python pandas graph networkx

我有从这些群集之间的移动创建的轨迹:

user_id,trajectory
11011,[[[86], [110], [110]]
2139671,[[89], [125]]
3945641,[[36], [73], [110], [110]]
10024312,[[123], [27], [97], [97], [97], [110]]
14270422,[[0], [110], [174]]
14283758,[[110], [184]]
14317445,[[50], [88]]
14331818,[[0], [22], [36], [131], [131]]
14334591,[[107], [19]]
14373703,[[35], [97], [97], [97], [17], [58]]

我想将多个动作的轨迹分成不同的段,但我不确定如何。

示例:

14373703,[[35], [97], [97], [97], [17], [58]]

进入

14373703,[[35,97], [97,97], [97,17], [17,58]]

目的是在NetworkX中将它们用作边缘,将它们分析为图形,并识别各个簇(节点)之间的密集运动(边缘)。

这是我用来创建轨迹的代码:

# Import Data
data = pd.read_csv('G:\Programming Projects\GGS 681\dmv_tweets_20170309_20170314_cluster_outputs.csv', delimiter=',', engine='python')
#print len(data),"rows"

# Create Data Fame
df = pd.DataFrame(data, columns=['user_id','timestamp','latitude','longitude','cluster_labels'])

# Filter Data Frame by count of user_id
filtered = df.groupby('user_id').filter(lambda x: x['user_id'].count()>1)
#filtered.to_csv('G:\Programming Projects\GGS 681\dmv_tweets_20170309_20170314_final_filtered.csv', index=False, header=True)

# Get a list of unique user_id values
uniqueIds = np.unique(filtered['user_id'].values)

# Get the ordered (by timestamp) coordinates for each user_id
output = [[id,filtered.loc[filtered['user_id']==id].sort_values(by='timestamp')[['cluster_labels']].values.tolist()] for id in uniqueIds]

# Save outputs as csv
outputs = pd.DataFrame(output)
#print outputs
headers = ['user_id','trajectory']
outputs.to_csv('G:\Programming Projects\GGS 681\dmv_tweets_20170309_20170314_cluster_moves.csv', index=False, header=headers)

如果可以这种方式分割,是否可以在处理过程中完成,而不是事后?我想在创建时执行它,以消除任何后期处理。

3 个答案:

答案 0 :(得分:2)

我的解决方案使用熊猫的魔力' .apply()功能。我相信这应该有效(我在你的样本数据上测试了这个)。请注意,当只有一次移动和没有移动时,我还在最后添加了一个额外的数据点。

# Python3.5
import pandas as pd 


# Sample data from post
ids = [11011,2139671,3945641,10024312,14270422,14283758,14317445,14331818,14334591,14373703,10000,100001]
traj = [[[86], [110], [110]],[[89], [125]],[[36], [73], [110], [110]],[[123], [27], [97], [97], [97], [110]],[[0], [110], [174]],[[110], [184]],[[50], [88]],[[0], [22], [36], [131], [131]],[[107], [19]],[[35], [97], [97], [97], [17], [58]],[10],[]]

# Sample frame
df = pd.DataFrame({'user_ids':ids, 'trajectory':traj})

def f(x):
    # Creates edges given list of moves
    if len(x) <= 1: return x
    s = [x[i]+x[i+1] for i in range(len(x)-1)]
    return s

df['edges'] = df['trajectory'].apply(lambda x: f(x))

输出:

print(df['edges'])

                                                edges  
0                             [[86, 110], [110, 110]]  
1                                         [[89, 125]]  
2                   [[36, 73], [73, 110], [110, 110]]  
3   [[123, 27], [27, 97], [97, 97], [97, 97], [97,...  
4                              [[0, 110], [110, 174]]  
5                                        [[110, 184]]  
6                                          [[50, 88]]  
7          [[0, 22], [22, 36], [36, 131], [131, 131]]  
8                                         [[107, 19]]  
9   [[35, 97], [97, 97], [97, 97], [97, 17], [17, ...  
10                                               [10]  
11                                                 []

至于您可以将其放在管道中的位置 - 只需在获得trajectory列后立即(无论是在加载数据后还是在您执行任何过滤后) )。

答案 1 :(得分:2)

如果zip你的轨迹与自身偏移一,你会得到你想要的结果。

<强>代码:

for id, traj in data.items():
    print(id, list([i[0], j[0]] for i, j in zip(traj[:-1], traj[1:])))

测试数据:

data = {
    11011: [[86], [110], [110]],
    2139671: [[89], [125]],
    3945641: [[36], [73], [110], [110]],
    10024312: [[123], [27], [97], [97], [97], [110]],
    14270422: [[0], [110], [174]],
    14283758: [[110], [184]],
    14373703: [[35], [97], [97], [97], [17], [58]],
}

<强>结果:

11011 [[86, 110], [110, 110]]
14373703 [[35, 97], [97, 97], [97, 97], [97, 17], [17, 58]]
3945641 [[36, 73], [73, 110], [110, 110]]
14283758 [[110, 184]]
14270422 [[0, 110], [110, 174]]
2139671 [[89, 125]]
10024312 [[123, 27], [27, 97], [97, 97], [97, 97], [97, 110]]

答案 2 :(得分:2)

我认为您可以将groupbyapply和自定义函数与zip一起使用,以获取必要列表理解中的列表输出列表:

<强>通知

count函数返回所有没有NaN值,如果按length进行过滤而没有更好的NaN是len

#filtering and sorting     
filtered = df.groupby('user_id').filter(lambda x: len(x['user_id'])>1)
filtered = filtered.sort_values(by='timestamp')

f = lambda x: [list(a) for a in zip(x[:-1], x[1:])]
df2 = filtered.groupby('user_id')['cluster_labels'].apply(f).reset_index()
print (df2)
    user_id                                     cluster_labels
0     11011                            [[86, 110], [110, 110]]
1   2139671                                        [[89, 125]]
2   3945641                  [[36, 73], [73, 110], [110, 110]]
3  10024312  [[123, 27], [27, 97], [97, 97], [97, 97], [97,...
4  14270422                             [[0, 110], [110, 174]]
5  14283758                                       [[110, 184]]
6  14373703  [[35, 97], [97, 97], [97, 97], [97, 17], [17, ...

类似的解决方案,过滤是boolean indexing的最后一步:

filtered = filtered.sort_values(by='timestamp')

f = lambda x: [list(a) for a in zip(x[:-1], x[1:])]
df2 = filtered.groupby('user_id')['cluster_labels'].apply(f).reset_index()
df2 = df2[df2['cluster_labels'].str.len() > 0]
print (df2)
    user_id                                     cluster_labels
1     11011                            [[86, 110], [110, 110]]
2   2139671                                        [[89, 125]]
3   3945641                  [[36, 73], [73, 110], [110, 110]]
4  10024312  [[123, 27], [27, 97], [97, 97], [97, 97], [97,...
5  14270422                             [[0, 110], [110, 174]]
6  14283758                                       [[110, 184]]
7  14373703  [[35, 97], [97, 97], [97, 97], [97, 17], [17, ...