我使用以下组件
这基本上是我的代码
Dataset<Row> rowsSource = spark.read()
.option("header", "true")
.option("delimiter", ";")
.csv("source.csv");
Dataset<Row> rowsTarget = spark.read()
.option("header", "true")
.option("delimiter", ";")
.csv("target.csv");
rowsSource.createOrReplaceTempView("source");
rowsTarget.createOrReplaceTempView("target");
Dataset<Row> result = spark.sql("SELECT source.Id FROM source" +
" LEFT OUTER JOIN target USING(Id)" +
" WHERE target.Id IS NULL");
result.show();
这里有一些测试数据:
源:
&#34;标识&#34 ;;&#34;状态&#34;
&#34; 1&#34 ;;&#34; ERROR&#34;
&#34; 2&#34 ;;&#34; OK&#34;
目标:
&#34;标识&#34 ;;&#34;状态&#34;
&#34; 2&#34 ;;&#34; OK&#34;
我希望,SQL语句只找到一个Id,那就是&#34; 1&#34;
但如果我运行它,并且在执行SQL语句的行中发生异常
2017-03-21 17:00:09,693 INFO [main] com.materna.mobility.smart.selenium.Aaaa: starting
Exception in thread "main" org.apache.spark.sql.AnalysisException: USING column `Detail` cannot be resolved on the left side of the join. The left-side columns: ["Detail", Detailp, Detaild, Detailb, Amount - 2016 48 +0100/1, Amount - 2016 49 +0100/1, Amount - 2016 50 +0100/1, Amount - 2016 51 +0100/1, Amount - 2016 52 +0100/1];
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$90$$anonfun$apply$56.apply(Analyzer.scala:1977)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$90$$anonfun$apply$56.apply(Analyzer.scala:1977)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$90.apply(Analyzer.scala:1976)
at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$90.apply(Analyzer.scala:1975)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$commonNaturalJoinProcessing(Analyzer.scala:1975)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$$anonfun$apply$31.applyOrElse(Analyzer.scala:1961)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$$anonfun$apply$31.applyOrElse(Analyzer.scala:1958)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$.apply(Analyzer.scala:1958)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$.apply(Analyzer.scala:1957)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:64)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:62)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
at MyClass.main(MyClass.java:48)
如果我在Id之前插入一个额外的分号(;),一切都按预期工作,这是一个例子:
;&#34;标识&#34 ;;&#34;状态&#34;
我认为Spark会解析3列,但由于第一列无效,因此会被忽略。
答案 0 :(得分:1)
我的CSV文件包含了BOM(字节顺序标记)(我刚发现)
字节顺序标记(BOM)是一个Unicode字符,U + FEFF字节顺序标记(BOM),其外观为文本流开头的幻数,可以向消费文本的程序发出几个信号
在做了一些搜索之后我发现了这个问题: https://github.com/databricks/spark-csv/issues/142
显然这是自2015年以来的一个问题
最简单的方法是从文件中删除BOM。
我发现的另一个修复程序(请参阅上面的问题)是,您可以在第一个列名称前添加一个附加分号。显然它然后再解析一列,但第一列无效并被忽略。但是:我强烈建议你不要使用它,因为将来可能会修复它并且上述解决方案更可靠
维基百科指出,如果我使用UTF-8(我做过),我应该期待我文件前面的以下字符(十六进制)&#34; EF BB BF&#34;
在这里你可以看到我期望的CSV文件应该是什么样的(因为我还没知道他们还有BOM),但实际上它们看起来如何
由于我缺乏声誉,我无法发布内容的图片,但在这里你去了: