Question

我遇到与this类似的问题，但我想检查多列中的重复项，并保留记录中最旧的时间戳。

我尝试使用此方法创建时间戳列顺序，然后删除重复项（删除重复项将保留第一条记录并删除下一条记录），这样就可以了。

from pyspark.sql.functions import unix_timestamp
...

pattern = "yyyy-MM-dd hh:mm:ss"

#Valid from is the timestamp column
#Extract time from the field and order ascending
df= df.withColumn("timestampCol", unix_timestamp(df["valid_from"], pattern).\
                  cast("timestamp")).\
                  orderBy(["timestampCol"],ascending = True)

#Drop duplicates based on all column except timestamps so only the older 
#timestamps stay
df = df.dropDuplicates(subset= [x for x in df.columns if x not in ["valid_from", "timestampCol"]])

此代码适用于小型数据集。但是当我尝试使用更大的数据集时，我遇到了严重的性能问题。我发现orderBy（）之后的dropDuplicates（）具有恶劣的性能。我试图缓存数据帧，但没有取得太大进展。

问题是我在控制台上开始丢弃重复项

[第x阶段：=============================> （1 + 1）/ 2]

它在那里叠了将近20分钟。

所以我的问题是：

为什么orderBy（）具有如此糟糕的性能后dropDuplicates（）？是否有其他方法可以实现相同的目标（在保留旧值的同时删除多个列上的重复项？
控制台输出是否意味着当时只有2个执行程序在运行？如果是这样我怎么能增加它们呢？我在YARN提交我的申请表： --num-executors 5 --executor-cores 5 --executor-memory 20G。为什么在这个特定点上我只有两个执行程序在运行？如何在此步骤中增加它们？

Spark Drop在多列上重复 - 性能问题

0 个答案: