我已经在scala中对此进行了一段时间的战斗,但似乎无法找到明确的解决方案。
我有2个数据框:
val Companies = Seq(
(8, "Yahoo"),
(-5, "Google"),
(12, "Microsoft"),
(-10, "Uber")
).toDF("movement", "Company")
val LookUpTable = Seq(
("B", "Buy"),
("S", "Sell")
).toDF("Code", "Description")
我需要在 Companies 中创建一列,以允许我加入查找表。它是一个简单的 case 语句,用于检查运动是否为负,然后卖出,否则买入。然后,我需要加入这个新创建的列上的查找表。
val joined = Companies.as("Companies")
.withColumn("Code",expr("CASE WHEN movement > 0 THEN 'B' ELSE 'S' END"))
.join(LookUpTable.as("LookUpTable"), $"LookUpTable.Code" === $"Code", "left_outer")
但是,我不断收到以下错误消息:
org.apache.spark.sql.AnalysisException: Reference 'Code' is ambiguous, could be: Code, LookUpTable.Code.;
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:101)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$40.apply(Analyzer.scala:888)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$40.apply(Analyzer.scala:890)
at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve(Analyzer.scala:887)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve$2.apply(Analyzer.scala:896)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve$2.apply(Analyzer.scala:896)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve(Analyzer.scala:896)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$35.apply(Analyzer.scala:956)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$35.apply(Analyzer.scala:956)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105
我尝试为代码添加别名,但这不起作用:
val joined = Companies.as("Companies")
.withColumn("Code",expr("CASE WHEN movement > 0 THEN 'B' ELSE 'S' END"))
.join(LookUpTable.as("LookUpTable"), $"LookUpTable.Code" === $"Companies.Code", "left_outer")
org.apache.spark.sql.AnalysisException: cannot resolve '`Companies.Code`' given input columns: [Code, LookUpTable.Code, LookUpTable.Description, Companies.Company, Companies.movement];;
'Join LeftOuter, (Code#102625 = 'Companies.Code)
:- Project [movement#102616, Company#102617, CASE WHEN (movement#102616 > 0) THEN B ELSE S END AS Code#102629]
: +- SubqueryAlias `Companies`
: +- Project [_1#102613 AS movement#102616, _2#102614 AS Company#102617]
: +- LocalRelation [_1#102613, _2#102614]
+- SubqueryAlias `LookUpTable`
+- Project [_1#102622 AS Code#102625, _2#102623 AS Description#102626]
+- LocalRelation [_1#102622, _2#102623]
我发现的唯一解决方法是为新创建的列添加别名,但是随后创建了一个感觉不正确的附加列。
val joined = Companies.as("Companies")
.withColumn("_Code",expr("CASE WHEN movement > 0 THEN 'B' ELSE 'S' END")).as("Code")
.join(LookUpTable.as("LookUpTable"), $"LookUpTable.Code" === $"Code", "left_outer")
joined.show()
+--------+---------+-----+----+-----------+
|movement| Company|_Code|Code|Description|
+--------+---------+-----+----+-----------+
| 8| Yahoo| B| B| Buy|
| 8| Yahoo| B| S| Sell|
| -5| Google| S| B| Buy|
| -5| Google| S| S| Sell|
| 12|Microsoft| B| B| Buy|
| 12|Microsoft| B| S| Sell|
| -10| Uber| S| B| Buy|
| -10| Uber| S| S| Sell|
+--------+---------+-----+----+-----------+
是否有一种方法可以联接新创建的列,而不必通过别名创建新的数据框或新列?
答案 0 :(得分:0)
如果您需要anytime::anytime()
中的列,则必须使用别名。这是因为Spark数据框API为所述数据框创建了一个架构,并且在给定的架构中,您永远不能拥有两个或更多个具有相同名称的列。
这也是在two different dataframes having same name
中进行不带别名的SQL
查询的原因,但是如果您要进行SELECT
,则会引发类似-{{1 }}。
答案 1 :(得分:0)
您是否尝试过在Spark数据框中使用Seq。
1。使用Seq 没有重复的列
val joined = Companies.as("Companies")
.withColumn("Code",expr("CASE WHEN movement > 0 THEN 'B' ELSE 'S' END"))
.join(LookUpTable.as("LookUpTable"), Seq("Code"), "left_outer")
- withColumn之后的别名,但它将生成重复的列
val joined = Companies.withColumn("Code",expr("CASE WHEN movement > 0 THEN 'B' ELSE 'S' END")).as("Companies")
.join(LookUpTable.as("LookUpTable"), $"LookUpTable.Code" === $"Companies.Code", "left_outer")
答案 2 :(得分:0)
表达式可用于连接:
val codeExpression = expr("CASE WHEN movement > 0 THEN 'B' ELSE 'S' END")
val joined = Companies.as("Companies")
.join(LookUpTable.as("LookUpTable"), $"LookUpTable.Code" === codeExpression, "left_outer")