我想为组中的每个组分配一个唯一的ID号(从0或1开始,然后使用pyspark为每个组递增1)。
我以前通过以下命令将python和pandas一起使用:
df['id_num'] = (df
.groupby('column_name')
.grouper
.group_info[0])
输入和所需输出的玩具示例为:
输入
+------+
|object|
+------+
|apple |
|orange|
|pear |
|berry |
|apple |
|pear |
|berry |
+------+
输出:
+------+--+
|object|id|
+------+--+
|apple |1 |
|orange|2 |
|pear |3 |
|berry |4 |
|apple |1 |
|pear |3 |
|berry |4 |
+------+--+
答案 0 :(得分:2)
我不确定订单是否重要。如果没有,您可以在这种情况下使用density_rank窗口函数
>>> from pyspark.sql.window import Window
>>> import pyspark.sql.functions as F
>>>
>>> df.show()
+------+
|object|
+------+
| apple|
|orange|
| pear|
| berry|
| apple|
| pear|
| berry|
+------+
>>>
>>> df.withColumn("id", F.dense_rank().over(Window.orderBy(df.object))).show()
+------+---+
|object| id|
+------+---+
| apple| 1|
| apple| 1|
| berry| 2|
| berry| 2|
|orange| 3|
| pear| 4|
| pear| 4|
+------+---+
答案 1 :(得分:0)
from pyspark.sql.functions import col, create_map, lit
from itertools import chain
values = [('apple',),('orange',),('pear',),('berry',),('apple',),('pear',),('berry',)]
df = sqlContext.createDataFrame(values,['object'])
#Creating a column of distinct elements and converting them into dictionary with unique indexes.
df1 = df.distinct()
distinct_list = list(df1.select('object').toPandas()['object'])
dict_with_index = {distinct_list[i]:i+1 for i in range(len(distinct_list))}
#Applying the mapping of dictionary.
mapping_expr = create_map([lit(x) for x in chain(*dict_with_index.items())])
df=df.withColumn("id", mapping_expr.getItem(col("object")))
df.show()
+------+---+
|object| id|
+------+---+
| apple| 2|
|orange| 1|
| pear| 3|
| berry| 4|
| apple| 2|
| pear| 3|
| berry| 4|
+------+---+