我有以下示例数据集。
ID Date
213412 2008-10-26T06:04:00.000Z
213412 2018-10-26T05:42:00.000Z
393859 2018-10-26T09:17:00.000Z
我有两个与上面相同的ID值。我只想保留ID 213412的两行之一...。我保留哪一行都没有关系。
我知道如何在Pandas Python中执行上述操作,但不知道如何在PySpark中执行此操作。
答案 0 :(得分:0)
您可以使用 dropDuplicates()
>>> cols = ['ID', 'Date']
>>> vals = [
('213412', '2008-10-26T06:04:00.000Z'),
('213412', '2008-10-26T06:04:00.000Z'),
('393859 ', '2018-10-26T09:17:00.000Z'),
]
# Create DataFrame
>>> df = spark.createDataFrame(vals, cols)
>>> df.show(3, False)
+--------+------------------------+
|ID |Date |
+--------+------------------------+
|213412 |2008-10-26T06:04:00.000Z|
|213412 |2008-10-26T06:04:00.000Z|
|393859 |2018-10-26T09:17:00.000Z|
+--------+------------------------+
# You can simply use df.dropDuplicates(), but by specifying the Column (ID) you are telling Spark to drop based on that column.
df_dist = df.dropDuplicates(["ID"])
df_dist.show(2, False)
+--------+------------------------+
|ID |Date |
+--------+------------------------+
|213412 |2008-10-26T06:04:00.000Z|
|393859 |2018-10-26T09:17:00.000Z|
+--------+------------------------+
有关更多信息,请参见