AnalysisException:"无法解析' df2。*'给输入栏Pyspark?

时间:2018-02-01 14:34:17

标签: python apache-spark pyspark apache-spark-sql pyspark-sql

我创建了两个数据框,如下所示。

df = spark.createDataFrame(
    [(1, 1, 2,4), (1, 2, 9,5), (2, 1, 2,1), (2, 2, 1,2), (4, 1, 5,2), (4, 2, 6,3), (5, 1, 3,3), (5, 2, 8,4)], 
   ("sid", "cid", "Cr","rank"))
df1 = spark.createDataFrame(
    [[1,1],[1,2],[1,3], [2,1],[2,2],[2,3],[4,1],[4,2],[4,3],[5,2],[5,3],[5,3],[3,4]],
    ["sid","cid"])

因为一些要求我创建了sqlContext并创建了临时视图,如下所示。

df.createOrReplaceTempView("temp")

df2=sqlContext.sql("select sid,cid,cr,rank from temp")

然后我根据某些条件做左联。

joined = (df2.alias("df")
    .join(
        df1.alias("df1"),
        (col("df.sid") == col("df1.sid")) & (col("df.cid") == col("df1.cid")),
        "left"))
joined.show()

+---+---+---+----+----+----+
|sid|cid| cr|rank| sid| cid|
+---+---+---+----+----+----+
|  5|  1|  3|   3|null|null|
|  1|  1|  2|   4|   1|   1|
|  4|  2|  6|   3|   4|   2|
|  5|  2|  8|   4|   5|   2|
|  2|  2|  1|   2|   2|   2|
|  4|  1|  5|   2|   4|   1|
|  1|  2|  9|   5|   1|   2|
|  2|  1|  2|   1|   2|   1|
+---+---+---+----+----+----+

然后我最终执行下面的代码:

final=joined.select(
    col("df2.*"),
    col("df1.sid").isNull().cast("integer").alias("flag")
).orderBy("sid", "cid")

然后我收到如下错误。

"AnalysisException: "cannot resolve 'df2.*' give input columns 'cr, sid, sid, cid, cid, rank';"

但我的预期出局应该是:

+---+---+---+----+----+
|sid|cid| Cr|rank|flag|
+---+---+---+----+----+
|  1|  1|  2|   4|   0|
|  1|  2|  9|   5|   0|
|  2|  1|  2|   1|   0|
|  2|  2|  1|   2|   0|
|  4|  1|  5|   2|   0|
|  4|  2|  6|   3|   0|
|  5|  1|  3|   3|   1|
|  5|  2|  8|   4|   0|
+---+---+---+----+----+ 

1 个答案:

答案 0 :(得分:0)

错误是:

joined = (df2.alias("df")
    .join(
        df1.alias("df1"),
        (col("df2.sid") == col("df1.sid")) & (col("df2.cid") == col("df1.cid")),
        "left"))
joined.show()

这里我们应该使用df2.alias("df2") or joined.select(col("df.*")..)

完整的解决方案是:

joined = (df2.alias("df2")
    .join(
        df1.alias("df1"),
        (col("df2.sid") == col("df1.sid")) & (col("df2.cid") == col("df1.cid")),
        "left"))
joined.show()

final=joined.select(
    col("df2.*"),
    col("df1.sid").isNull().cast("integer").alias("flag")
).orderBy("sid", "cid")