Question

我使用monotonically_increasing_id（）使用以下语法将行号分配给pyspark数据帧：

df1 = df1.withColumn("idx", monotonically_increasing_id())

现在df1有26,572,528条记录。所以我期待idx值从0-26,572,527。

但是当我选择max（idx）时，它的值非常大：335,008,054,165。

这个功能发生了什么？使用此函数与具有相似记录数的其他数据集合并是否可靠？

我有大约300个数据帧，我想将它们组合成一个数据帧。因此，一个数据框包含ID，而其他数据框包含与行相对应的不同记录

Answer 1

来自documentation

生成单调递增的64位整数的列。

生成的ID保证单调增加且唯一，但不是连续的。当前实现将分区ID放在高31位中，将每个分区中的记录号放在低33位中。假设数据框的分区少于10亿，每个分区的记录少于80亿。

因此，它不像RDB中的自动增量ID，并且不可靠合并。

如果您需要像RDB一样的自动增量行为并且您的数据是可排序的，那么您可以使用row_number

df.createOrReplaceTempView('df')
spark.sql('select row_number() over (order by "some_column") as num, * from df')
+---+-----------+
|num|some_column|
+---+-----------+
|  1|   ....... |
|  2|   ....... |
|  3| ..........|
+---+-----------+

如果您的数据不可排序，并且您不介意使用rdds创建索引然后回退到数据框架，则可以使用rdd.zipWithIndex()

可以找到一个例子here

简而言之：

# since you have a dataframe, use the rdd interface to create indexes with zipWithIndex()
df = df.rdd.zipWithIndex()
# return back to dataframe
df = df.toDF()

df.show()

# your data           | indexes
+---------------------+---+
|         _1          | _2| 
+-----------=---------+---+
|[data col1,data col2]|  0|
|[data col1,data col2]|  1|
|[data col1,data col2]|  2|
+---------------------+---+

之后您可能需要进行一些转换才能使您的数据帧达到您需要的状态。注意：不是一个非常高效的解决方案。

希望这会有所帮助。祝你好运！

修改考虑一下，您可以将monotonically_increasing_id组合起来使用row_number：

# create a monotonically increasing id df = df.withColumn("idx", monotonically_increasing_id()) # then since the id is increasing but not consecutive, it means you can sort by it, so you can use the `row_number` df.createOrReplaceTempView('df') new_df = spark.sql('select row_number() over (order by "idx") as num, * from df')

不确定性能。

Answer 2

使用api函数，您可以简单地执行以下操作

from pyspark.sql.window import Window as W
from pyspark.sql import functions as F
df1 = df1.withColumn("idx", F.monotonically_increasing_id())
windowSpec = W.orderBy("idx")
df1.withColumn("idx", F.row_number().over(windowSpec)).show()

我希望答案很有帮助

Answer 3

我发现@mkaran的解决方案很有用，但是对我来说，使用window函数时没有排序列。我想保持数据框的行顺序作为它们的索引（在大熊猫数据框中会看到什么）。因此，编辑部分中的解决方案开始使用。由于这是一个很好的解决方案（如果不考虑性能），我想将其作为一个单独的答案来分享。

# Add a increasing data column 
df_index = df.withColumn("idx", monotonically_increasing_id())

# Create the window specification
w = Window.orderBy("idx")

# Use row number with the window specification
df_index = df_index.withColumn("index", F.row_number().over(w))

# Drop the created increasing data column
df2_index = df2_index.drop("idx")

df是您的原始数据框，而df_index是新的数据框。

Answer 4

要合并相同大小的数据帧，请在rdds上使用zip

from pyspark.sql.types import StructType

spark = SparkSession.builder().master("local").getOrCreate()
df1 = spark.sparkContext.parallelize([(1, "a"),(2, "b"),(3, "c")]).toDF(["id", "name"])
df2 = spark.sparkContext.parallelize([(7, "x"),(8, "y"),(9, "z")]).toDF(["age", "address"])

schema = StructType(df1.schema.fields + df2.schema.fields)
df1df2 = df1.rdd.zip(df2.rdd).map(lambda x: x[0]+x[1])
spark.createDataFrame(df1df2, schema).show()

但是请注意该方法的帮助

    Assumes that the two RDDs have the same number of partitions and the same
    number of elements in each partition (e.g. one was made through
    a map on the other).

Answer 5

以@mkaran答案为基础，

df.coalesce(1).withColumn("idx", monotonicallyIncreasingId())

使用.coalesce(1)将Dataframe放在一个分区中，因此单调增加了和连续索引列。确保大小合理，可以放在一个分区中，这样可以避免以后出现潜在问题。值得注意的是，我事先已按升序对数据框进行排序。

下面是对有和没有合并的情况下我的外观的预览比较，其中我有一个50行的摘要数据框，

df.coalesce(1).withColumn("No", monotonicallyIncreasingId()).show(60)

<身体>

startTimes	endTimes	否
2019-11-01 05:39:50	2019-11-01 06:12:50	0
2019-11-01 06:23:10	2019-11-01 06:23:50	1
2019-11-01 06:26:49	2019-11-01 06:46:29	2
2019-11-01 07:00:29	2019-11-01 07:04:09	3
2019-11-01 15:24:29	2019-11-01 16:04:59	4
2019-11-01 16:23:38	2019-11-01 17:27:58	5
2019-11-01 17:32:18	2019-11-01 17:47:58	6
2019-11-01 17:54:18	2019-11-01 18:00:00	7
2019-11-02 04:42:40	2019-11-02 04:49:20	8
2019-11-02 05:11:40	2019-11-02 05:22:00	9

df.withColumn("runNo", monotonically_increasing_id).show(60)

<身体>

startTimes	endTimes	否
2019-11-01 05:39:50	2019-11-01 06:12:50	0
2019-11-01 06:23:10	2019-11-01 06:23:50	8589934592
2019-11-01 06:26:49	2019-11-01 06:46:29	17179869184
2019-11-01 07:00:29	2019-11-01 07:04:09	25769803776
2019-11-01 15:24:29	2019-11-01 16:04:59	34359738368
2019-11-01 16:23:38	2019-11-01 17:27:58	42949672960
2019-11-01 17:32:18	2019-11-01 17:47:58	51539607552
2019-11-01 17:54:18	2019-11-01 18:00:00	60129542144
2019-11-02 04:42:40	2019-11-02 04:49:20	68719476736
2019-11-02 05:11:40	2019-11-02 05:22:00	77309411328

使用monotonically_increasing_id（）将行号分配给pyspark数据帧

5 个答案: