在非常大的pandas数据帧上执行pyproj.Geod计算

时间:2014-04-25 17:56:06

标签: python pandas geospatial

背景

我有一个带有大约200k +行数据的pandas Dataframe。

<class 'pandas.core.frame.DataFrame'>
Int64Index: 212812 entries, 0 to 212811
Data columns (total 10 columns):
date         212812  non-null values
animal_id    212812  non-null values
lons         212812  non-null values
lats         212812  non-null values
depth        212812  non-null values
prey1        212812  non-null values
prey2        212812  non-null values
prey3        212812  non-null values
dist         212812  non-null values
sog          212812  non-null values
dtypes: float64(9), int64(1), object(1)

对于每个日期,有1000个人具有lon / lat位置。

我想计算每个人的每日距离变化,这是我使用pyproj.Geod.inv为100个人成功完成的,但是人口的增加已经大大减缓了事情。

问题

是否有一种使用pyproj.Geod.inv等外部类方法对pandas数据帧执行计算的有效方法?

示例例程

    ids = np.unique(data['animal_id'])

    for animal in ids:
        id_idx = data['animal_id']==animal
        dates = data['date'][id_idx]
        for i in range(len(dates)-1):
            idx1 = (data['animal_id']==id) & (data['date']==dates[i])
            idx2 = (data['animal_id']==id) & (data['date']==dates[i+1])
            lon1 = data['lons'][idx1]
            lat1 = data['lats'][idx1]
            lon2 = data['lons'][idx2]
            lat2 = data['lats'][idx2]
            fwd_az, bck_az, dist = g.inv(lon1,lat1,lon2,lat2)
            data['dist'][idx2] = dist
            data['sog'][idx2]  = dist/24. #dist/time(hours)

1 个答案:

答案 0 :(得分:0)

我提出了解决方案,但我真的很感激有关这种方法的建议,或者更有效的方式来执行我的解决方案。

我首先使用pandas shift方法添加了移位的lon / lat列(inspired by this SO question),因此我可以在一行上执行计算。

然后我使用pandas apply方法(as was suggested here)来实现pyproj.Geod.inv计算,循环浏览每个人pandas DataFrame的切片人口。

def calc_distspd(df):
    '''Broadcast pyproj distance calculation over pandas dataframe'''

    import pyproj
    import numpy as np

    def calcdist(x):
        '''Pandas broadcast function for pyproj distance calculations'''
        return g.inv(x['lons+1'], x['lats+1'], x['lons'], x['lats'])[2]

    # Define Earth ellipsoid for dist calculations
    g = pyproj.Geod(ellps='WGS84')

    # Create array of zeros to initialize new columns
    fill_data = np.zeros(df['date'].shape)

    # Create new columns for calculated vales
    df['dist'] = fill_data
    df['sog']  = fill_data
    df['lons+1'] = fill_data
    df['lats+1'] = fill_data

    # Get list of unique animal_ids
    animal_ids = np.unique(df.animal_id.values)

    # Peform function broadcast for each individual
    for animal_id in animal_ids:
        idx = df['animal_id']==animal_id

        # Add shifted position columns for dist calculations
        df['lons+1'] = df['lons'].shift(1) # lon+1 = origin position
        df['lats+1'] = df['lats'].shift(1) # lat+1 = origin position

        # Copy 1st position over shifted column nans to prevent error
        idx2 = (idx) & (np.isnan(df[lons+1]))
        df['lons+1'][idx2] = df['lons'][idx2]
        df['lats+1'][idx2] = df['lats'][idx2]

        df['dist'][idx] = df[idx].apply(calcdist, axis=1)
        df['sog'][idx]  = df['dist']/24. # Calc hourly speed

    # Remove shifted position columns from df
    del df['lons+1']
    del df['lats+1']

    return df