Question

我有一个Pandas DataFrame，其中的行对应于事件，而列则对应于这些事件的时间，纬度和经度。看起来像这样：

     time                      latitude   longitude
0    1994-03-01 03:49:00.830    49.096     32.617 . . .
1    1994-10-04 11:41:28.080    10.964    133.891 . . .
2    1995-06-02 03:38:03.890    19.803    -52.799 . . .
3    1995-08-21 19:17:15.300   -19.851   -175.043 . . .
.
.
.

我要做的是在此数据集中对事件进行分组，以便将事件与特定时间和距离timedif和spacedif中的每个事件进行分组。

例如，假设timedif为1年（忽略另一个变量），那么我想要一个用于上述事件0的组，其中包含事件1但不包含事件2，并且事件1不应该接收组，因为它在组0中。然后是事件2的第二个组，其中包含3个，依此类推。

我目前正在尝试的效率很低：

dfbuild = dfbuild.append({'head index': 0, 'sub index': [] },ignore_index = True)
for i in dfog.index:
    for j in dfbuild.index:
        if(timecomp(dfog.loc[dfbuild.loc[j]['head index']]['time'],dfog.loc[i]['time']) < timedif ):
            if(geopy.distance.distance( (dfog.loc[i]['latitude'],dfog.loc[i]['longitude']),(dfog.loc[dfbuild.loc[j]['head index']]['latitude'],dfog.loc[dfbuild.loc[j]['head index']]['longitude']) ).km < spacedif ):
                head = j
                break
    if(head == -1):
        dfbuild = dfbuild.append({'head index': i, 'sub index': [] },ignore_index = True)
    else:
        dfbuild.loc[head]['sub index'].append(i)
    head = -1

（timecomp仅使用日期时间将字符串转换为日期时间，然后减去它们；我正在使用geopy.distance.distance（）函数来获取经度和纬度之间的距离）

我知道这很丑陋，我认为我在错误地使用.loc，但确实有效；我最终得到一个带有两列的DataFrame，一列具有head index值，一列具有所有对应的sub index值。但是它非常慢，而且随着数据集变大，它会成倍地变慢。

该如何加快速度？我也不喜欢这样做，因此，如果我应该完全废弃它并以不同的方式进行操作，那是一个选择。

请注意，数据集中的行按时间顺序排列。

Answer 1

尝试使用Geopandas http://geopandas.org 对于时间分组，例如：

times = pd.to_datetime(dfbuild.time)
dfbuild.groupby([times.hour, times.minute]).count()

根据时间/位置功能快速对DataFrame的行进行分组

1 个答案: