Question

下面，我提供了我的架构和用于从hdfs分区中读取的代码。

分区的一个示例可能是以下路径：/home/maria_dev/data/key=key/date=19 jan（当然，在此文件夹中还有一个包含cnt的csv文件）

因此，我拥有的数据按key和date列进行分区。

当我阅读下面的内容时，无法正确读取列，因此cnt被读入date，反之亦然。

我该如何解决？

private val tweetSchema = new StructType(Array(
    StructField("date", StringType, nullable = true),
    StructField("key", StringType, nullable = true),
    StructField("cnt", IntegerType, nullable = true)
  ))

// basePath example: /home/maria_dev/data
// path example: /home/maria_dev/data/key=key/data=19 jan
private def loadDF(basePath: String, path: String, format: String): DataFrame = {
    val df = spark.read
      .schema(tweetSchema)
      .format(format)
      .option("basePath", basePath)
      .load(path)
    df
}

我尝试将其在架构中的顺序从(date, key, cnt)更改为(cnt, key, date)，但这无济于事。

我的问题是，当我调用union时，它会附加两个数据帧：

df1：{(key: 1, date: 2)}
df2：{(date: 3, key: 4)}

进入最终数据帧，如下所示：{(key: 1, date: 2), (date: 3, key: 4)}。如您所见，列被弄乱了。

Answer 1

架构应遵循以下顺序：

数据文件中的列本身-如果是CSV（从左到右以自然顺序排列）。
用于分区的列的顺序与目录结构所定义的顺序相同。

因此，根据您的情况，正确的顺序将是：

new StructType(Array(
  StructField("cnt", IntegerType, nullable = true),
  StructField("key", StringType, nullable = true),
  StructField("date", StringType, nullable = true)
))

Answer 2

事实证明，所有内容均已正确读取。

所以，现在，我做了df1.union(df2)，而不是做df1.select("key", "date").union(df2.select("key", "date"))，它起作用了。

为什么Spark无法从HDFS正确加载列？

2 个答案: