我有一个结构类似于
的数据框+----+-----+-------+------+------+------+
| cod| name|sum_vol| date| lat| lon|
+----+-----+-------+------+------+------+
|aggc|23124| 37|201610|-15.42|-32.11|
|aggc|23124| 19|201611|-15.42|-32.11|
| abc| 231| 22|201610|-26.42|-43.11|
| abc| 231| 22|201611|-26.42|-43.11|
| ttx| 231| 10|201610|-22.42|-46.11|
| ttx| 231| 10|201611|-22.42|-46.11|
| tty| 231| 25|201610|-25.42|-42.11|
| tty| 231| 45|201611|-25.42|-42.11|
|xptx| 124| 62|201611|-26.43|-43.21|
|xptx| 124| 260|201610|-26.43|-43.21|
|xptx|23124| 50|201610|-26.43|-43.21|
|xptx|23124| 50|201611|-26.43|-43.21|
+----+-----+-------+------+------+------+
现在我想汇总lat
和lon
值,但使用我自己的函数:
def get_centroid(lat, lon):
# ...do whatever I need here
return t_lat, t_lon
get_c = udf(lambda x, y: get_centroid(x,y), FloatType())
gg = df.groupby('cod', 'name').agg(get_c('lat', 'lon'))
但是我收到以下错误:
u"expression 'pythonUDF' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;"
有没有办法在不使用UDAF的情况下获取组的元素并对其进行操作?类似于pandas
df.groupby(['cod','name'])[['lat', 'lon']].apply(f).to_frame().reset_index()