Question

我想为组中的每个组分配一个唯一的ID号（从0或1开始，然后使用pyspark为每个组递增1）。

我以前通过以下命令将python和pandas一起使用：

df['id_num'] = (df
                .groupby('column_name')
                .grouper
                .group_info[0])

输入和所需输出的玩具示例为：

输入

+------+
|object|
+------+
|apple |
|orange|
|pear  |
|berry |
|apple |
|pear  |
|berry |
+------+

输出：

+------+--+
|object|id|
+------+--+
|apple |1 |
|orange|2 |
|pear  |3 |
|berry |4 |
|apple |1 |
|pear  |3 |
|berry |4 |
+------+--+

Answer 1

我不确定订单是否重要。如果没有，您可以在这种情况下使用density_rank窗口函数

>>> from pyspark.sql.window import Window
>>> import pyspark.sql.functions as F
>>> 
>>> df.show()
+------+
|object|
+------+
| apple|
|orange|
|  pear|
| berry|
| apple|
|  pear|
| berry|
+------+
>>> 
>>> df.withColumn("id", F.dense_rank().over(Window.orderBy(df.object))).show()
+------+---+
|object| id|
+------+---+
| apple|  1|
| apple|  1|
| berry|  2|
| berry|  2|
|orange|  3|
|  pear|  4|
|  pear|  4|
+------+---+

Answer 2

from pyspark.sql.functions import col, create_map, lit
from itertools import chain
values = [('apple',),('orange',),('pear',),('berry',),('apple',),('pear',),('berry',)]
df = sqlContext.createDataFrame(values,['object'])

#Creating a column of distinct elements and converting them into dictionary with unique indexes.
df1 = df.distinct()
distinct_list = list(df1.select('object').toPandas()['object'])
dict_with_index = {distinct_list[i]:i+1 for i in range(len(distinct_list))}

#Applying the mapping of dictionary.
mapping_expr = create_map([lit(x) for x in chain(*dict_with_index.items())])
df=df.withColumn("id", mapping_expr.getItem(col("object")))
df.show()
+------+---+
|object| id|
+------+---+
| apple|  2|
|orange|  1|
|  pear|  3|
| berry|  4|
| apple|  2|
|  pear|  3|
| berry|  4|
+------+---+

Pyspark-通过ID分配组中的每个组

2 个答案: