我有两个要加入的数据框。 DF的架构如下所示:
itemsDF.printSchema()
root
|-- asin: string (nullable = true)
|-- brand: string (nullable = true)
|-- title: string (nullable = true)
|-- url: string (nullable = true)
|-- image: string (nullable = true)
|-- rating: float (nullable = true)
|-- reviewUrl: string (nullable = true)
|-- totalReviews: integer (nullable = true)
reviewsDF.printSchema()
root
|-- asin: string (nullable = true)
|-- name: string (nullable = true)
|-- rating: float (nullable = true)
|-- date: date (nullable = true)
|-- verified: boolean (nullable = true)
|-- title: string (nullable = true)
|-- helpfulVotes: float (nullable = true)
我想在列asin
处加入两个数据框。此表达式似乎可以正常工作:
reviewsDF.join(itemsDF, reviewsDF['asin'] == itemsDF['asin']).show()
但是,以下表达式给出了错误:
reviewsDF.join(itemsDF, reviewsDF.asin == itemsDF.asin).show()
AnalysisException: 'Detected implicit cartesian product for INNER join between logical plans\nProject [asin#347, name#348, cast(rating#349 as float) AS rating#459, cast(cast(unix_timestamp(date#350, MMMM dd, yyyy, Some(America/Los_Angeles)) as timestamp) as date) AS date#458, cast(verified#351 as boolean) AS verified#460, title#352, cast(helpfulVotes#354 as float) AS helpfulVotes#461]\n+- Relation[asin#347,name#348,rating#349,date#350,verified#351,title#352,body#353,helpfulVotes#354] csv\nand\nProject [asin#846, brand#847, title#848, url#849, image#850, cast(rating#851 as float) AS rating#864, reviewUrl#852, cast(totalReviews#853 as int) AS totalReviews#865]\n+- Filter (isnotnull(asin#846) && (asin#846 = asin#846))\n +- Relation[asin#846,brand#847,title#848,url#849,image#850,rating#851,reviewUrl#852,totalReviews#853,prices#854] csv\nJoin condition is missing or trivial.\nEither: use the CROSS JOIN syntax to allow cartesian products between these\nrelations, or: enable implicit cartesian products by setting the configuration\nvariable spark.sql.crossJoin.enabled=true;'
为什么第二个表达式失败?