我有一个csv数据集,如下所示:
created_date,latitude,longitude
"2018-10-02 16:52:54",20.56314546,-100.40871983
"2018-10-07 18:06:37",20.56899227,-100.40879701
"2018-10-08 11:55:31",20.57479211,-100.39687493
"2018-10-08 11:55:31",20.58076244,-100.36075875
"2018-10-08 11:55:31",20.60529101,-100.40951731
"2018-10-08 11:55:31",20.60783806,-100.37852743
"2018-10-09 18:10:00",20.61098901,-100.38008197
"2018-10-09 18:10:00",20.61148848,-100.40851908
"2018-10-09 18:10:00",20.61327334,-100.34415272
"2018-10-09 18:10:00",20.61397514,-100.33583425
我正在尝试使用熊猫按日期将数据分成几组,然后想遍历每组并使用以2坐标为参数的hasrsine函数计算每组中经纬度之间的距离。
为此,我必须计算coord1 with coord2, coord 2 with coord 3 and so on (from the group)
的距离
我想这样做是为了计算平均行驶距离。然后,我必须将距离相加,然后除以组数。
使用大熊猫,我设法将数据分为几组,但是我不确定如何遍历这些组,同时排除了没有2个坐标来计算距离的组(例如“ 2018-10-02 16:52:54
”)。
我当前的python脚本如下:
col_names = ['date', 'latitude', 'longitude']
data = pd.read_csv('dataset.csv', names=col_names, sep=',', skiprows=1)
grouped = data.groupby('date')
for index, item in grouped:
感谢任何指导,我对操作方法有一个大致的了解,但是我不确定zip之类的工具是否可以帮助我解决这个问题。
答案 0 :(得分:1)
这是一种选择。它涉及在组内执行巨大的合并,给出所有成对的组合。然后删除所有相同的行合并,就可以计算一次距离。
import pandas as pd
import numpy as np
def haversine(lon1, lat1, lon2, lat2):
# convert degrees to radians
lon1 = np.deg2rad(lon1)
lat1 = np.deg2rad(lat1)
lon2 = np.deg2rad(lon2)
lat2 = np.deg2rad(lat2)
# formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
r_e = 6371
return c * r_e
# merge
m = df.reset_index().merge(df.reset_index(), on='created_date')
# remove comparisons of the same event
m = m[m.index_x != m.index_y].drop(columns = ['index_x', 'index_y'])
# Calculate Distance
m['Distance'] = haversine(m.longitude_x, m.latitude_x, m.longitude_y, m.latitude_y)
m
created_date latitude_x longitude_x latitude_y longitude_y Distance
3 2018-10-08 11:55:31 20.574792 -100.396875 20.580762 -100.360759 3.817865
4 2018-10-08 11:55:31 20.574792 -100.396875 20.605291 -100.409517 3.637698
5 2018-10-08 11:55:31 20.574792 -100.396875 20.607838 -100.378527 4.141211
...
30 2018-10-09 18:10:00 20.613975 -100.335834 20.610989 -100.380082 4.617105
31 2018-10-09 18:10:00 20.613975 -100.335834 20.611488 -100.408519 7.569825
32 2018-10-09 18:10:00 20.613975 -100.335834 20.613273 -100.344153 0.869261
要获取每个日期的平均值:
m.groupby('created_date').Distance.mean()
#created_date
#2018-10-08 11:55:31 4.021623
#2018-10-09 18:10:00 4.411060
#Name: Distance, dtype: float64
由于我们之前对合并的DataFrame
进行了子集设置,因此只会为created_dates
提供超过1个测量值的输出。
要在date
上合并,而不是确切的时间:
df['created_date'] = pd.to_datetime(df.created_date)
df['ng'] = df.groupby(df.created_date.dt.date).ngroup()
m = df.reset_index().merge(df.reset_index(), on='ng')
m = m[m.index_x != m.index_y].drop(columns = ['index_x', 'index_y'])
...