Pyspark:标识列中具有相同值的两行...只想保留一行

时间:2018-10-29 19:37:32

标签: python filter pyspark duplicates

我有以下示例数据集。

ID      Date
213412  2008-10-26T06:04:00.000Z
213412  2018-10-26T05:42:00.000Z
393859  2018-10-26T09:17:00.000Z

我有两个与上面相同的ID值。我只想保留ID 213412的两行之一...。我保留哪一行都没有关系。

我知道如何在Pandas Python中执行上述操作,但不知道如何在PySpark中执行此操作。

1 个答案:

答案 0 :(得分:0)

您可以使用 dropDuplicates()

在DataFrame中采样数据

>>> cols = ['ID', 'Date']

>>> vals =  [
        ('213412', '2008-10-26T06:04:00.000Z'),
        ('213412', '2008-10-26T06:04:00.000Z'),
        ('393859  ', '2018-10-26T09:17:00.000Z'),
        ]

# Create DataFrame
>>> df = spark.createDataFrame(vals, cols)
>>> df.show(3, False)

+--------+------------------------+
|ID      |Date                    |
+--------+------------------------+
|213412  |2008-10-26T06:04:00.000Z|
|213412  |2008-10-26T06:04:00.000Z|
|393859  |2018-10-26T09:17:00.000Z|
+--------+------------------------+

使用dropDuplicates()

# You can simply use df.dropDuplicates(), but by specifying the Column (ID) you are telling Spark to drop based on that column.

df_dist = df.dropDuplicates(["ID"])
df_dist.show(2, False)

+--------+------------------------+
|ID      |Date                    |
+--------+------------------------+
|213412  |2008-10-26T06:04:00.000Z|
|393859  |2018-10-26T09:17:00.000Z|
+--------+------------------------+

有关更多信息,请参见