两个Spark DataFrame的简单连接失败,并且" org.apache.spark.sql.AnalysisException:无法解析列名"

时间:2015-09-02 14:47:00

标签: csv apache-spark apache-spark-sql spark-dataframe

更新 事实证明这与Databricks Spark CSV阅读器创建DataFrame的方式有关。在下面不起作用的示例中,我使用Databricks CSV阅读器读取人员和地址CSV,然后将结果DataFrame以Parquet格式写入HDFS。

我更改了代码以创建DataFrame :(类似于people.csv)

JavaRDD<Address> address = context.textFile("/Users/sfelsheim/data/address.csv").map(
            new Function<String, Address>() {
                public Address call(String line) throws Exception {
                    String[] parts = line.split(",");

                    Address addr = new Address();
                    addr.setAddrId(parts[0]);
                    addr.setCity(parts[1]);
                    addr.setState(parts[2]);
                    addr.setZip(parts[3]);

                    return addr;
                }
            });

然后将结果DataFrame以Parquet格式写入HDFS,并且连接按预期工作

我在两种情况下都在阅读完全相同的CSV。

尝试执行从HDFS上的两个不同镶木地板文件创建的两个DataFrame的简单连接时遇到的问题。

[main] INFO org.apache.spark.SparkContext - 运行Spark版本1.4.1

使用 Hadoop 2.7.0

中的HDFS

以下是一个示例。

 public void testStrangeness(String[] args) {
    SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("joinIssue");
    JavaSparkContext context = new JavaSparkContext(conf);
    SQLContext sqlContext = new SQLContext(context);

    DataFrame people = sqlContext.parquetFile("hdfs://localhost:9000//datalake/sample/people.parquet");
    DataFrame address = sqlContext.parquetFile("hdfs://localhost:9000//datalake/sample/address.parquet");

    people.printSchema();
    address.printSchema();

    // yeah, works
    DataFrame cartJoin = address.join(people);
    cartJoin.printSchema();

    // boo, fails 
    DataFrame joined = address.join(people,
            address.col("addrid").equalTo(people.col("addressid")));

    joined.printSchema();
}

人的内容

first,last,addressid 
your,mom,1 
fred,flintstone,2

地址内容

addrid,city,state,zip
1,sometown,wi,4444
2,bedrock,il,1111
people.printSchema(); 

导致......

root
 |-- first: string (nullable = true)
 |-- last: string (nullable = true)
 |-- addressid: integer (nullable = true)

address.printSchema();

导致......

root
 |-- addrid: integer (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zip: integer (nullable = true)


DataFrame cartJoin = address.join(people);
cartJoin.printSchema();

笛卡尔连接工作正常,printSchema()导致...

root
 |-- addrid: integer (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zip: integer (nullable = true)
 |-- first: string (nullable = true)
 |-- last: string (nullable = true)
 |-- addressid: integer (nullable = true)

此加入......

DataFrame joined = address.join(people,
address.col("addrid").equalTo(people.col("addressid")));

导致以下异常。

Exception in thread "main" org.apache.spark.sql.AnalysisException: **Cannot resolve column name "addrid" among (addrid, city, state, zip);**
    at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
    at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
    at org.apache.spark.sql.DataFrame.col(DataFrame.scala:558)
    at dw.dataflow.DataflowParser.testStrangeness(DataflowParser.java:36)
    at dw.dataflow.DataflowParser.main(DataflowParser.java:119)

我尝试更改它,以便人员和地址具有共同的键属性(addressid)并使用..

address.join(people, "addressid");

但得到了相同的结果。

任何想法??

由于

1 个答案:

答案 0 :(得分:0)

能够通过使用Notepad ++来解决这个问题。根据&#34;编码&#34;菜单,我从&#34;在UTF-8 BOM中编码&#34; to&#34;用UTF-8编码&#34;。