我想在某个列上计算一个总和,但是每当达到某个值时就重置总和。
我已经阅读了几个有关有条件重置cumsum的问题。 它们都涉及某种具有“重置值”的其他列。
我正在使用geopy的距离函数来计算距产生的第一个点(第0行)的距离
lat lng all_distances
0 39.984198 116.319322 0.000000
12 39.984611 116.319822 62.663690
24 39.984252 116.320826 128.601760
36 39.983916 116.320980 145.036185
48 39.982688 116.321225 233.518640
60 39.981441 116.321305 349.856365
72 39.980291 116.321430 469.693983
但是我想要的是计算直到我达到200的距离,然后再次计算总和,但是用下一个点替换“第一个”点。
这是可运行的MCVE,因此可以将其时间与矢量化时间进行比较。
import pandas as pd
from geopy.distance import distance
print(pd.__version__)
data = [[ 39.984198, 116.319322],
[ 39.984611, 116.319822],
[ 39.984252, 116.320826],
[ 39.983916, 116.32098 ],
[ 39.982688, 116.321225],
[ 39.981441, 116.321305],
[ 39.980291, 116.32143 ],
[ 39.979675, 116.321805],
[ 39.979546, 116.322926],
[ 39.979758, 116.324513]]
user_gps_log = pd.DataFrame(data, columns=['lat', 'lng'])
first_lat = user_gps_log.iloc[0].lat
first_lng = user_gps_log.iloc[0].lng
all_distances = user_gps_log.apply(lambda x: distance((x.lat, x.lng), (first_lat, first_lng)).m, axis=1)
user_gps_log['all_distances'] = all_distances
p = user_gps_log
i = 0
dist_thres = 2
while i < len(p):
j = i+1
while j < len(p):
dist = distance((p.iloc[i].lat, p.iloc[i].lng), (p.iloc[j].lat, p.iloc[j].lng)).m
if dist > dist_thres:
# do stuff
i = j
token = 1
break
j = j+1
编辑 更新
尝试使用njit实现(无法避免迭代..)
@njit
def cumsum_distance(lat, lng, limit=200):
running_distance = 0
first = (lat[0], lng[0])
for i in range(lat.shape[0]):
dist = distance(first, (lat[i], lng[i])).m
running_distance += dist
if running_distance > limit:
yield i, running_distance
running_distance = 0
runnig_distances = cumsum_distance(user_gps_log.lat.values, user_gps_log.lng.values, 200)
出现此错误:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'distance': cannot determine Numba type of <class 'type'>
File "<ipython-input-194-7214618c7e64>", line 6:
def cumsum_distance(lat, lng, limit=200):
<source elided>
for i in range(lat.shape[0]):
dist = distance(first, (lat[i], lng[i])).m
^
This is not usually a problem with Numba itself but instead often caused by
the use of unsupported features or an issue in resolving types.
是因为我正在使用geopy的距离函数吗?我需要注册一个与在pyspark中使用udaf时相同的“类型”吗?