我有一个Spark 1.6数据帧(3列),其中包含以下格式的记录 -
----------
w p s
----------
w1 p1 0
w1 p1 1
w1 p1 2
w1 p2 0
w1 p2 1
w2 p1 0
w2 p1 1
w2 p3 0
w2 p3 1
w2 p3 2
w2 p4 0
w1 p4 1
w3 p1 0
w3 p1 1
w3 p2 0
w3 p3 0
w3 p4 0
w4 p1 0
w4 p1 1
// 根据更新的输入更新了输出 接下来我想转换它以获得在第2列和第3列中具有相同值的行,如下所示,消除所有不常见的记录 -
----------
w p s
----------
w1 p1 0
w1 p1 1
w2 p1 0
w2 p1 1
w3 p1 0
w3 p1 1
w4 p1 0
w4 p1 1
我尝试使用自连接,但是我得到输入数据帧中的所有记录,我能想到的另一种方法是获得w和p的独特组合并将其与输入数据帧连接,然后获得p和p的不同组合s并将其与输入数据帧连接。
任何人都可以建议我更好地获得所需的输出。
// 更新 - 使用自我加入 我使用了以下自连接查询 -
df.registerTempTable("t")
sqlContext.sql("select distinct t1.w,t1.p,t1.s FROM t AS t1 JOIN t AS t2 ON t1.p = t2.p and t1.s = t2.s where t1.w != t2.w")
导致后面的输出,p2,p3,p4不会在所有w上重叠,所以它们不应该在输出中 -
w1 p1 0
w1 p1 1
w2 p1 0
w2 p1 1
w3 p1 0
w3 p1 1
w4 p1 0
w4 p1 1
w1 p2 0
w2 p3 0
w2 p4 0
w3 p2 0
w3 p3 0
w3 p4 0
// 使用窗口函数更新,我没有使用窗口函数,所以我尝试了这个简单的查询,但我不知道如何才能得到理想的结果,我很接近但不确定是什么失踪 -
val df1 = sqlContext.sql("select w,p,s, row_number() over ( order by p,s) as rn, rank() over ( order by p,s) as rk, dense_rank() over ( order by p,s) as dr from t")
val df2 = sqlContext.sql("select w,p,s, row_number() over (partition by p order by p,s) as rn, rank() over (partition by p order by p,s) as rk, dense_rank() over (partition by p order by p,s) as dr from t")
感谢任何帮助
答案 0 :(得分:2)
此答案不需要加入。为清晰起见,提供了评论
//should know the distinct w to check for the output
val distinctListOfW = df.select("w").distinct().collect.map(row => row.getAs[String](0))
//selecting p ans s of any random w (first here) to check of p and s repetion across all w collected above
val firstW = distinctListOfW(0)
val p_and_s_of_first_w = df.filter($"w" === firstW).select("p", "s").collect().map(row => (row.getAs[String](0), row.getAs[String](1)))
//creating empty dataframe for merging sebsequent dataframes which matches the logic
val emptyDF = Seq(("temp", "temp", "temp")).toDF("w", "p", "s")
//fold left on the p and s collected of any random w and if the condition of count of distinct w matches with p and s then merge else return the previous df
val finaldf = p_and_s_of_first_w.foldLeft(emptyDF){(tempdf, ps) => {
val filtered = df.filter($"p" === ps._1 && $"s" === ps._2)
val tempdistinctListOfW = filtered.select("w").distinct().collect.map(row => row.getAs[String](0))
if(filtered.count() > 0 && (tempdistinctListOfW.length == distinctListOfW.length)){
tempdf.union(filtered)
}
else{
tempdf
}
}}.filter($"w" =!= "temp")
// this is the final required result
finaldf.show(false)
应该给你
+---+---+---+
|w |p |s |
+---+---+---+
|w1 |p1 |0 |
|w2 |p1 |0 |
|w3 |p1 |0 |
|w4 |p1 |0 |
|w1 |p1 |1 |
|w2 |p1 |1 |
|w3 |p1 |1 |
|w4 |p1 |1 |
+---+---+---+
答案 1 :(得分:1)
像这样的简单SELF-JOIN应该有效吗?
from pyspark.sql.functions import col
df.as("df1").select(col("w").alias("w1"), col("p"), col("s"))
.join(df.as("df2").select(col("w").alias("w2"), col("p"), col("s")), ["p","s"])
.filter(col("w1") != col("w2"))