背景:
我有一个带有大约200k +行数据的pandas Dataframe。
<class 'pandas.core.frame.DataFrame'>
Int64Index: 212812 entries, 0 to 212811
Data columns (total 10 columns):
date 212812 non-null values
animal_id 212812 non-null values
lons 212812 non-null values
lats 212812 non-null values
depth 212812 non-null values
prey1 212812 non-null values
prey2 212812 non-null values
prey3 212812 non-null values
dist 212812 non-null values
sog 212812 non-null values
dtypes: float64(9), int64(1), object(1)
对于每个日期,有1000个人具有lon / lat位置。
我想计算每个人的每日距离变化,这是我使用pyproj.Geod.inv为100个人成功完成的,但是人口的增加已经大大减缓了事情。
问题:
是否有一种使用pyproj.Geod.inv
等外部类方法对pandas数据帧执行计算的有效方法?
示例例程:
ids = np.unique(data['animal_id'])
for animal in ids:
id_idx = data['animal_id']==animal
dates = data['date'][id_idx]
for i in range(len(dates)-1):
idx1 = (data['animal_id']==id) & (data['date']==dates[i])
idx2 = (data['animal_id']==id) & (data['date']==dates[i+1])
lon1 = data['lons'][idx1]
lat1 = data['lats'][idx1]
lon2 = data['lons'][idx2]
lat2 = data['lats'][idx2]
fwd_az, bck_az, dist = g.inv(lon1,lat1,lon2,lat2)
data['dist'][idx2] = dist
data['sog'][idx2] = dist/24. #dist/time(hours)
答案 0 :(得分:0)
我提出了解决方案,但我真的很感激有关这种方法的建议,或者更有效的方式来执行我的解决方案。
我首先使用pandas
shift
方法添加了移位的lon / lat列(inspired by this SO question),因此我可以在一行上执行计算。
然后我使用pandas apply
方法(as was suggested here)来实现pyproj.Geod.inv
计算,循环浏览每个人pandas
DataFrame
的切片人口。
def calc_distspd(df):
'''Broadcast pyproj distance calculation over pandas dataframe'''
import pyproj
import numpy as np
def calcdist(x):
'''Pandas broadcast function for pyproj distance calculations'''
return g.inv(x['lons+1'], x['lats+1'], x['lons'], x['lats'])[2]
# Define Earth ellipsoid for dist calculations
g = pyproj.Geod(ellps='WGS84')
# Create array of zeros to initialize new columns
fill_data = np.zeros(df['date'].shape)
# Create new columns for calculated vales
df['dist'] = fill_data
df['sog'] = fill_data
df['lons+1'] = fill_data
df['lats+1'] = fill_data
# Get list of unique animal_ids
animal_ids = np.unique(df.animal_id.values)
# Peform function broadcast for each individual
for animal_id in animal_ids:
idx = df['animal_id']==animal_id
# Add shifted position columns for dist calculations
df['lons+1'] = df['lons'].shift(1) # lon+1 = origin position
df['lats+1'] = df['lats'].shift(1) # lat+1 = origin position
# Copy 1st position over shifted column nans to prevent error
idx2 = (idx) & (np.isnan(df[lons+1]))
df['lons+1'][idx2] = df['lons'][idx2]
df['lats+1'][idx2] = df['lats'][idx2]
df['dist'][idx] = df[idx].apply(calcdist, axis=1)
df['sog'][idx] = df['dist']/24. # Calc hourly speed
# Remove shifted position columns from df
del df['lons+1']
del df['lats+1']
return df