我创建了两个数据框,如下所示。
df = spark.createDataFrame(
[(1, 1, 2,4), (1, 2, 9,5), (2, 1, 2,1), (2, 2, 1,2), (4, 1, 5,2), (4, 2, 6,3), (5, 1, 3,3), (5, 2, 8,4)],
("sid", "cid", "Cr","rank"))
df1 = spark.createDataFrame(
[[1,1],[1,2],[1,3], [2,1],[2,2],[2,3],[4,1],[4,2],[4,3],[5,2],[5,3],[5,3],[3,4]],
["sid","cid"])
因为一些要求我创建了sqlContext并创建了临时视图,如下所示。
df.createOrReplaceTempView("temp")
df2=sqlContext.sql("select sid,cid,cr,rank from temp")
然后我根据某些条件做左联。
joined = (df2.alias("df")
.join(
df1.alias("df1"),
(col("df.sid") == col("df1.sid")) & (col("df.cid") == col("df1.cid")),
"left"))
joined.show()
+---+---+---+----+----+----+
|sid|cid| cr|rank| sid| cid|
+---+---+---+----+----+----+
| 5| 1| 3| 3|null|null|
| 1| 1| 2| 4| 1| 1|
| 4| 2| 6| 3| 4| 2|
| 5| 2| 8| 4| 5| 2|
| 2| 2| 1| 2| 2| 2|
| 4| 1| 5| 2| 4| 1|
| 1| 2| 9| 5| 1| 2|
| 2| 1| 2| 1| 2| 1|
+---+---+---+----+----+----+
然后我最终执行下面的代码:
final=joined.select(
col("df2.*"),
col("df1.sid").isNull().cast("integer").alias("flag")
).orderBy("sid", "cid")
然后我收到如下错误。
"AnalysisException: "cannot resolve 'df2.*' give input columns 'cr, sid, sid, cid, cid, rank';"
但我的预期出局应该是:
+---+---+---+----+----+
|sid|cid| Cr|rank|flag|
+---+---+---+----+----+
| 1| 1| 2| 4| 0|
| 1| 2| 9| 5| 0|
| 2| 1| 2| 1| 0|
| 2| 2| 1| 2| 0|
| 4| 1| 5| 2| 0|
| 4| 2| 6| 3| 0|
| 5| 1| 3| 3| 1|
| 5| 2| 8| 4| 0|
+---+---+---+----+----+
答案 0 :(得分:0)
错误是:
joined = (df2.alias("df")
.join(
df1.alias("df1"),
(col("df2.sid") == col("df1.sid")) & (col("df2.cid") == col("df1.cid")),
"left"))
joined.show()
这里我们应该使用df2.alias("df2") or joined.select(col("df.*")..)
完整的解决方案是:
joined = (df2.alias("df2")
.join(
df1.alias("df1"),
(col("df2.sid") == col("df1.sid")) & (col("df2.cid") == col("df1.cid")),
"left"))
joined.show()
final=joined.select(
col("df2.*"),
col("df1.sid").isNull().cast("integer").alias("flag")
).orderBy("sid", "cid")