Question

我在hdfs位置中存储了多个文件，如下所示

/ user / project / 202005 / part-01798

/ user / project / 202005 / part-01799

有2000个这样的零件文件。每个文件的格式为

download() {
  ...csv setup here
  // Create an array of observables based on the Employee array
  forkJoin(this.Employee.map(emp => this.restApi.getEmployeeAsset(emp.id))
    // Pipe the Observables through the map operator to set up data
    .pipe(
      map(assets => {
        assets.forEach(
          if(assets.length !== 0){
            this.reportData.push(...the data you want)
          } else {
            this.reportData.push(...the other data you want)
          }
      })
    ).subscribe(() => {
      ...do what you need here with mapping over this.reportData to create your csv
    })
}

等等。我有两个问题

{'Name':'abc','Age':28,'Marks':[20,25,30]} 
{'Name':...}

Answer 1

由于这些文件位于一个目录中，并且被命名为part-xxxxx文件，因此可以放心地假设这些文件是同一数据集的多个零件文件。如果这些是分区，则应像这样/ user / project / date = 202005 / *
您可以将目录“ / user / project / 202005”指定为spark的输入，如下所示，假设这些是csv文件

df = spark.read.csv('/user/project/202005/*',header=True, inferSchema=True)

使用pySpark将hdfs的零件文件读入数据帧

1 个答案: