达到某个值时重置的熊猫积

时间:2019-04-27 13:11:32

标签: pandas vectorization cumsum

我想在某个列上计算一个总和,但是每当达到某个值时就重置总和。

我已经阅读了几个有关有条件重置cumsum的问题。 它们都涉及某种具有“重置值”的其他列。

我正在使用geopy的距离函数来计算距产生的第一个点(第0行)的距离

    lat lng all_distances
0   39.984198   116.319322  0.000000
12  39.984611   116.319822  62.663690
24  39.984252   116.320826  128.601760
36  39.983916   116.320980  145.036185
48  39.982688   116.321225  233.518640
60  39.981441   116.321305  349.856365
72  39.980291   116.321430  469.693983

但是我想要的是计算直到我达到200的距离,然后再次计算总和,但是用下一个点替换“第一个”点。

这是可运行的MCVE,因此可以将其时间与矢量化时间进行比较。

import pandas as pd
from geopy.distance import distance
print(pd.__version__)

data = [[ 39.984198, 116.319322],
       [ 39.984611, 116.319822],
       [ 39.984252, 116.320826],
       [ 39.983916, 116.32098 ],
       [ 39.982688, 116.321225],
       [ 39.981441, 116.321305],
       [ 39.980291, 116.32143 ],
       [ 39.979675, 116.321805],
       [ 39.979546, 116.322926],
       [ 39.979758, 116.324513]]

user_gps_log = pd.DataFrame(data, columns=['lat', 'lng'])

first_lat = user_gps_log.iloc[0].lat
first_lng = user_gps_log.iloc[0].lng
all_distances = user_gps_log.apply(lambda x: distance((x.lat, x.lng), (first_lat, first_lng)).m, axis=1)

user_gps_log['all_distances'] = all_distances

p = user_gps_log
i = 0
dist_thres = 2

while i < len(p):
    j = i+1
    while j < len(p):
        dist = distance((p.iloc[i].lat, p.iloc[i].lng), (p.iloc[j].lat, p.iloc[j].lng)).m
        if dist > dist_thres:
            # do stuff
            i = j
            token = 1
        break
    j = j+1

编辑 更新

尝试使用njit实现(无法避免迭代..)

@njit
def cumsum_distance(lat, lng, limit=200):
    running_distance = 0
    first = (lat[0], lng[0])
    for i in range(lat.shape[0]):
        dist = distance(first, (lat[i], lng[i])).m
        running_distance += dist
        if running_distance > limit:
            yield i, running_distance
            running_distance = 0

runnig_distances = cumsum_distance(user_gps_log.lat.values, user_gps_log.lng.values, 200)

出现此错误:

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'distance': cannot determine Numba type of <class 'type'>

File "<ipython-input-194-7214618c7e64>", line 6:
def cumsum_distance(lat, lng, limit=200):
    <source elided>
    for i in range(lat.shape[0]):
        dist = distance(first, (lat[i], lng[i])).m
        ^

This is not usually a problem with Numba itself but instead often caused by
the use of unsupported features or an issue in resolving types.

是因为我正在使用geopy的距离函数吗?我需要注册一个与在pyspark中使用udaf时相同的“类型”吗?

0 个答案:

没有答案