Apache Spark在第一列中插入引号

时间:2017-03-21 16:20:10

标签: sql apache-spark apache-spark-sql

我的设置

我使用以下组件

  • 火花core_2.10
  • 火花sql_2.10

我的问题

这基本上是我的代码

Dataset<Row> rowsSource = spark.read()
    .option("header", "true")
    .option("delimiter", ";")
    .csv("source.csv");

Dataset<Row> rowsTarget = spark.read()
    .option("header", "true")
    .option("delimiter", ";")
    .csv("target.csv");

rowsSource.createOrReplaceTempView("source");
rowsTarget.createOrReplaceTempView("target");

Dataset<Row> result = spark.sql("SELECT source.Id FROM source" +
                                " LEFT OUTER JOIN target USING(Id)" +
                                " WHERE target.Id IS NULL");

result.show();

这里有一些测试数据:

源:

  

&#34;标识&#34 ;;&#34;状态&#34;

     

&#34; 1&#34 ;;&#34; ERROR&#34;

     

&#34; 2&#34 ;;&#34; OK&#34;

目标:

  

&#34;标识&#34 ;;&#34;状态&#34;

     

&#34; 2&#34 ;;&#34; OK&#34;

我希望,SQL语句只找到一个Id,那就是&#34; 1&#34;

但如果我运行它,并且在执行SQL语句的行中发生异常

2017-03-21 17:00:09,693 INFO  [main] com.materna.mobility.smart.selenium.Aaaa: starting
Exception in thread "main" org.apache.spark.sql.AnalysisException: USING column `Detail` cannot be resolved on the left side of the join. The left-side columns: ["Detail", Detailp, Detaild, Detailb, Amount - 2016 48 +0100/1, Amount - 2016 49 +0100/1, Amount - 2016 50 +0100/1, Amount - 2016 51 +0100/1, Amount - 2016 52 +0100/1];
    at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$90$$anonfun$apply$56.apply(Analyzer.scala:1977)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$90$$anonfun$apply$56.apply(Analyzer.scala:1977)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$90.apply(Analyzer.scala:1976)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$90.apply(Analyzer.scala:1975)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.AbstractTraversable.map(Traversable.scala:105)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$commonNaturalJoinProcessing(Analyzer.scala:1975)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$$anonfun$apply$31.applyOrElse(Analyzer.scala:1961)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$$anonfun$apply$31.applyOrElse(Analyzer.scala:1958)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:331)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:58)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$.apply(Analyzer.scala:1958)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin$.apply(Analyzer.scala:1957)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
    at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
    at scala.collection.immutable.List.foldLeft(List.scala:84)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:64)
    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:62)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
    at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
    at MyClass.main(MyClass.java:48)

如果我在Id之前插入一个额外的分号(;),一切都按预期工作,这是一个例子:

  

;&#34;标识&#34 ;;&#34;状态&#34;

我认为Spark会解析3列,但由于第一列无效,因此会被忽略。

1 个答案:

答案 0 :(得分:1)

问题

我的CSV文件包含了BOM(字节顺序标记)(我刚发现)

  

字节顺序标记(BOM)是一个Unicode字符,U + FEFF字节顺序标记(BOM),其外观为文本流开头的幻数,可以向消费文本的程序发出几个信号

在做了一些搜索之后我发现了这个问题: https://github.com/databricks/spark-csv/issues/142

显然这是自2015年以来的一个问题

修复

最简单的方法是从文件中删除BOM。

我发现的另一个修复程序(请参阅上面的问题)是,您可以在第一个列名称前添加一个附加分号。显然它然后再解析一列,但第一列无效并被忽略。但是:我强烈建议你不要使用它,因为将来可能会修复它并且上述解决方案更可靠

可视化

维基百科指出,如果我使用UTF-8(我做过),我应该期待我文件前面的以下字符(十六进制)&#34; EF BB BF&#34;

在这里你可以看到我期望的CSV文件应该是什么样的(因为我还没知道他们还有BOM),但实际上它们看起来如何

由于我缺乏声誉,我无法发布内容的图片,但在这里你去了: