Question

我有一个结构类似于

的数据框

+----+-----+-------+------+------+------+
| cod| name|sum_vol|  date|   lat|   lon|
+----+-----+-------+------+------+------+
|aggc|23124|     37|201610|-15.42|-32.11|
|aggc|23124|     19|201611|-15.42|-32.11|
| abc|  231|     22|201610|-26.42|-43.11|
| abc|  231|     22|201611|-26.42|-43.11|
| ttx|  231|     10|201610|-22.42|-46.11|
| ttx|  231|     10|201611|-22.42|-46.11|
| tty|  231|     25|201610|-25.42|-42.11|
| tty|  231|     45|201611|-25.42|-42.11|
|xptx|  124|     62|201611|-26.43|-43.21|
|xptx|  124|    260|201610|-26.43|-43.21|
|xptx|23124|     50|201610|-26.43|-43.21|
|xptx|23124|     50|201611|-26.43|-43.21|
+----+-----+-------+------+------+------+

现在我想汇总lat和lon值，但使用我自己的函数：

def get_centroid(lat, lon):
    # ...do whatever I need here
    return t_lat, t_lon
get_c = udf(lambda x, y: get_centroid(x,y), FloatType())

gg = df.groupby('cod', 'name').agg(get_c('lat', 'lon'))

但是我收到以下错误：

u"expression 'pythonUDF' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;"

有没有办法在不使用UDAF的情况下获取组的元素并对其进行操作？类似于pandas

的东西

df.groupby(['cod','name'])[['lat', 'lon']].apply(f).to_frame().reset_index()

将不同的聚合函数应用于PySpark组

0 个答案: