使用pySpark将hdfs的零件文件读入数据帧

时间:2020-05-30 16:24:00

标签: pyspark apache-spark-sql hdfs partitioning

我在hdfs位置中存储了多个文件,如下所示

/ user / project / 202005 / part-01798

/ user / project / 202005 / part-01799

有2000个这样的零件文件。每个文件的格式为

download() {
  ...csv setup here
  // Create an array of observables based on the Employee array
  forkJoin(this.Employee.map(emp => this.restApi.getEmployeeAsset(emp.id))
    // Pipe the Observables through the map operator to set up data
    .pipe(
      map(assets => {
        assets.forEach(
          if(assets.length !== 0){
            this.reportData.push(...the data you want)
          } else {
            this.reportData.push(...the other data you want)
          }
      })
    ).subscribe(() => {
      ...do what you need here with mapping over this.reportData to create your csv
    })
}

等等。我有两个问题

{'Name':'abc','Age':28,'Marks':[20,25,30]} 
{'Name':...} 

1 个答案:

答案 0 :(得分:1)

  1. 由于这些文件位于一个目录中,并且被命名为part-xxxxx文件,因此可以放心地假设这些文件是同一数据集的多个零件文件。如果这些是分区,则应像这样/ user / project / date = 202005 / *
  2. 保存
  3. 您可以将目录“ / user / project / 202005”指定为spark的输入,如下所示,假设这些是csv文件
df = spark.read.csv('/user/project/202005/*',header=True, inferSchema=True)