我正在阅读一个镶木地板文件
val df1 = sqlContext.read.parquet("")
df1.count = 5000000 records
val df2 = df1.select("id","aid", "DId" , "dd" , "mm", "yy","TO")
.distinct()
.groupBy("id","aid", "DId" , "dd" , "mm","yy","TO")
.count()
.filter($"count" > 1)
.select("id","aid", "DId" , "dd" , "mm", "yy")
val df3 = df2.join(df1,
df2("cid") <=> df1("cid") &&
df2("aid") <=> df1("aid") &&
df2("did") <=> df1("did") &&
df2("dd") <=> df1("dd") &&
df2("mm") <=> df1("mm") &&
df2("yy") <=> df1("yy"), "inner")
.distinct()
当我执行df3.count时...它占用了将近897517ms这太高了它影响整个作业执行而有时候作业被中止如果我增加超时选项则执行但延迟太高。我需要建议改进这个
答案 0 :(得分:0)
根据您的评论:
我的意图类似于sql查询SELECT cid,aid,DId,dd,mm,yy FROM(SELECT DISTINCT cid,aid,DId,dd,mm,yy,TO FROM nsc WHERE cid = 1)t GROUP BY cid ,援助,DId,dd,mm,yy HAVING COUNT(*)&gt; 1;我将其转换为dataframe
我认为正确的代码应该是:
df
.select("id", "aid", "DId", "dd", "mm", "yy", "TO")
.distinct()
.filter(col("cid") <=> 1)
.groupBy("id", "aid", "DId", "dd", "mm", "yy")
.count()
.filter(col("count") > 1)
.select("id", "aid", "DId", "dd", "mm", "yy")
所以不需要做最后的加入。它应该更快。