遍历数据框并获取索引

时间:2019-09-12 13:24:07

标签: scala dataframe apache-spark

我想遍历DataFrame的流/列,获取当前的行/列索引,并执行一些其他操作。有什么方便的方法可以从行/列中获取索引号吗?

目标是通过Apache POI库将输出保存到xlsx文件,因此可能需要对每个单元进行迭代。

// proceed throught each row/column of the DataFrame
myDataframe.foreach{row =>
  row.toSeq.foreach{col =>
    val rowNum = row.???
    val colNum = col.???
    // further operations on the data...
    // like save the output to the xlsx file with the Apache POI
  }
}

我正在开发Spark 1.6.3。和Scala 2.10.5。

1 个答案:

答案 0 :(得分:1)

您可以使用 row_number()添加索引:

  val myDataframe = sc.parallelize(List("a", "b", "c", "d")).toDF("value")
  val withIndex = myDataframe.select(row_number().over(Window.orderBy('value)).as("index").cast("INT"), '*)

  myDataframe.foreach { row =>
    for (i <- 0 until (row.length)) {
      val rowNum = row.getInt(0)
      val colNum = i
    }
  }

但是,如果要将df保存为excel文件,则应收集数据。 然后将其转换为数组数组/ 2d数组。

 val list: Array[Array[String]] = withIndex
    .select(concat_ws(",", withIndex.columns.map(withIndex(_)): _*))
    .map(s => s.getString(0))
    .collect()
    .map(s => s.toString.split(","))

  for (elem <- 0 until  list.length) {
    for (elem2 <- 0 until list.apply(elem).length) {
      println(list.apply(elem).apply(elem2),", row:"+elem+", col:"+elem2)
    }
  }

(1,, row:0, col:0)
(a,, row:0, col:1)
(2,, row:1, col:0)
(b,, row:1, col:1)
(3,, row:2, col:0)
(c,, row:2, col:1)
(4,, row:3, col:0)
(d,, row:3, col:1)

我不知道apache poi在scala中是如何工作的,但是在Java中,它应该看起来像这样:

            FileInputStream inputStream = new FileInputStream(new File(excelFilePath));
            Workbook workbook = WorkbookFactory.create(inputStream);
            Sheet newSheet = workbook.createSheet("spark");

            // your data from DataFrame
            Object[][] bookComments = {
                    {"1", "a"},
                    {"2", "b"},
                    {"3", "c"},
                    {"4", "d"},
            };

            int rowCount = 0;

            for (Object[] aBook : bookComments) {
                Row row = newSheet.createRow(++rowCount);

                int columnCount = 0;

                for (Object field : aBook) {
                    Cell cell = row.createCell(++columnCount);
                    if (field instanceof String) {
                        cell.setCellValue((String) field);
                    } else if (field instanceof Integer) {
                        cell.setCellValue((Integer) field);
                    }
                }

            }

            FileOutputStream outputStream = new FileOutputStream("JavaBooks.xlsx");
            workbook.write(outputStream);
            workbook.close();
            outputStream.close();