Question

假设我有一个pyspark DataFrame（DF）：

-----------------------------
record_id | foo | bar
-----------------------------
1 | random text | random text
2 | random text | random text
3 | random text | random text
1 | random text | random text
2 | random text | random text
-----------------------------

我的最终目标是使用.write.jdbc()将这些行写入MySQL，这是我一直在成功完成的。但现在，在此之前，根据unique列的唯一性添加新列record_id。

我在识别具有类似内容的唯一record_id方面取得了一些进展：

df.select('record_id').distinct().rdd.map(lambda r: r[0])

但与Panda的DataFrames不同，我不相信这有一个我可以重用的索引，它似乎只是值。我仍然是Spark / Pyspark的新手。

尝试找出以下工作流程是否有意义？

识别具有不同record_id的行，并写入MySQL
然后，识别剩余的行，并写入MySQL

或者是否可以更改原始DF，根据某些链式命令添加新列unique？类似下面的东西，然后我可以写入MySQL批发：

----------------------------------
record_id | foo | bar | unique 
----------------------------------
1 | random text | random text | 0
2 | random text | random text | 0
3 | random text | random text | 1 # where 1 for boolean True
1 | random text | random text | 0
2 | random text | random text | 0
----------------------------------

非常感谢任何建议或意见！

Answer 1

您可以计算行数 partitionBy record_id，如果record_id只有一行，则将其标记为 unique ：

from pyspark.sql.window import Window
import pyspark.sql.functions as F

df.withColumn("unique", (F.count("record_id").over(Window.partitionBy("record_id")) == 1).cast('integer')).show()
+---------+-----------+-----------+------+
|record_id|        foo|        bar|unique|
+---------+-----------+-----------+------+
|        3|random text|random text|     1|
|        1|random text|random text|     0|
|        1|random text|random text|     0|
|        2|random text|random text|     0|
|        2|random text|random text|     0|
+---------+-----------+-----------+------+

Pyspark DataFrame选择具有不同值的行和具有非不同值的行

1 个答案: