使用数据框列的属性方法时,Pyspark连接表达式错误

时间:2019-11-18 01:43:31

标签: apache-spark pyspark apache-spark-sql

我有两个要加入的数据框。 DF的架构如下所示:

itemsDF.printSchema()

root
 |-- asin: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- title: string (nullable = true)
 |-- url: string (nullable = true)
 |-- image: string (nullable = true)
 |-- rating: float (nullable = true)
 |-- reviewUrl: string (nullable = true)
 |-- totalReviews: integer (nullable = true)

reviewsDF.printSchema()

root
 |-- asin: string (nullable = true)
 |-- name: string (nullable = true)
 |-- rating: float (nullable = true)
 |-- date: date (nullable = true)
 |-- verified: boolean (nullable = true)
 |-- title: string (nullable = true)
 |-- helpfulVotes: float (nullable = true)

我想在列asin处加入两个数据框。此表达式似乎可以正常工作:

reviewsDF.join(itemsDF, reviewsDF['asin'] == itemsDF['asin']).show()

但是,以下表达式给出了错误:

reviewsDF.join(itemsDF, reviewsDF.asin == itemsDF.asin).show()
AnalysisException: 'Detected implicit cartesian product for INNER join between logical plans\nProject [asin#347, name#348, cast(rating#349 as float) AS rating#459, cast(cast(unix_timestamp(date#350, MMMM dd, yyyy, Some(America/Los_Angeles)) as timestamp) as date) AS date#458, cast(verified#351 as boolean) AS verified#460, title#352, cast(helpfulVotes#354 as float) AS helpfulVotes#461]\n+- Relation[asin#347,name#348,rating#349,date#350,verified#351,title#352,body#353,helpfulVotes#354] csv\nand\nProject [asin#846, brand#847, title#848, url#849, image#850, cast(rating#851 as float) AS rating#864, reviewUrl#852, cast(totalReviews#853 as int) AS totalReviews#865]\n+- Filter (isnotnull(asin#846) && (asin#846 = asin#846))\n   +- Relation[asin#846,brand#847,title#848,url#849,image#850,rating#851,reviewUrl#852,totalReviews#853,prices#854] csv\nJoin condition is missing or trivial.\nEither: use the CROSS JOIN syntax to allow cartesian products between these\nrelations, or: enable implicit cartesian products by setting the configuration\nvariable spark.sql.crossJoin.enabled=true;'

为什么第二个表达式失败?

0 个答案:

没有答案