我想为我的pyspark数据框创建一个ID列,我有一个具有重复数字的A列,我想采用所有不同的值并为每个值分配一个ID
我有:
+----+
| A|
+----+
|1001|
|1002|
|1003|
|1001|
|1003|
|1004|
|1001|
+----+
我想要
+----+----+
| A| new|
+----+----+
|1002| 1|
|1001| 2|
|1004| 3|
|1003| 4|
+----+----+
这是我的代码:
# Libraries
import pyspark
from pyspark.sql import SQLContext
import pandas as pd
import numpy as np
from pyspark import SparkContext
sc = SparkContext()
sqlContext = SQLContext(sc)
# Create pyspark dtaframe
df = pd.DataFrame()
df["A"] = [1001,1002,1003,1001,1003,1004,1001]
df = sqlContext.createDataFrame(df)
IDs = df.select("A").distinct()
from pyspark.sql.functions import monotonically_increasing_id
IDs = IDs.withColumn("new", monotonically_increasing_id() )
IDs.show()
我得到:
+----+-------------+
| A| new|
+----+-------------+
|1002| 188978561024|
|1001|1065151889408|
|1004|1511828488192|
|1003|1623497637888|
+----+-------------+
但是应该是:
+----+----+
| A| new|
+----+----+
|1002| 1|
|1001| 2|
|1004| 3|
|1003| 4|
+----+----+
为什么我得到那个结果?
答案 0 :(得分:0)
monotonically_increasing_id
保证单调递增且唯一,但不连续。您可以使用函数row_number()
而不是monotonically_increasing_id
来获得更理想的结果。
>>> from pyspark.sql.window import Window
>>> from pyspark.sql.functions import row_number, lit
// lit to keep everything in one partition
>>> w = Window.partitionBy(lit(1)).orderBy("A")
>>> df.show()
+----+
| A|
+----+
|1001|
|1003|
|1001|
|1004|
|1005|
|1003|
|1005|
|1003|
|1006|
|1001|
|1002|
+----+
>>> df1 = df.select("A").distinct().withColumn("ID", row_number().over(w))
>>> df1.show()
+----+---+
| A| ID|
+----+---+
|1001| 1|
|1002| 2|
|1003| 3|
|1004| 4|
|1005| 5|
|1006| 6|
+----+---+