Question

我想遍历DataFrame的流/列，获取当前的行/列索引，并执行一些其他操作。有什么方便的方法可以从行/列中获取索引号吗？

目标是通过Apache POI库将输出保存到xlsx文件，因此可能需要对每个单元进行迭代。

// proceed throught each row/column of the DataFrame
myDataframe.foreach{row =>
  row.toSeq.foreach{col =>
    val rowNum = row.???
    val colNum = col.???
    // further operations on the data...
    // like save the output to the xlsx file with the Apache POI
  }
}

我正在开发Spark 1.6.3。和Scala 2.10.5。

Answer 1

您可以使用 row_number（）添加索引：

  val myDataframe = sc.parallelize(List("a", "b", "c", "d")).toDF("value")
  val withIndex = myDataframe.select(row_number().over(Window.orderBy('value)).as("index").cast("INT"), '*)

  myDataframe.foreach { row =>
    for (i <- 0 until (row.length)) {
      val rowNum = row.getInt(0)
      val colNum = i
    }
  }

但是，如果要将df保存为excel文件，则应收集数据。然后将其转换为数组数组/ 2d数组。

 val list: Array[Array[String]] = withIndex
    .select(concat_ws(",", withIndex.columns.map(withIndex(_)): _*))
    .map(s => s.getString(0))
    .collect()
    .map(s => s.toString.split(","))

  for (elem <- 0 until  list.length) {
    for (elem2 <- 0 until list.apply(elem).length) {
      println(list.apply(elem).apply(elem2),", row:"+elem+", col:"+elem2)
    }
  }

(1,, row:0, col:0)
(a,, row:0, col:1)
(2,, row:1, col:0)
(b,, row:1, col:1)
(3,, row:2, col:0)
(c,, row:2, col:1)
(4,, row:3, col:0)
(d,, row:3, col:1)

我不知道apache poi在scala中是如何工作的，但是在Java中，它应该看起来像这样：

            FileInputStream inputStream = new FileInputStream(new File(excelFilePath));
            Workbook workbook = WorkbookFactory.create(inputStream);
            Sheet newSheet = workbook.createSheet("spark");

            // your data from DataFrame
            Object[][] bookComments = {
                    {"1", "a"},
                    {"2", "b"},
                    {"3", "c"},
                    {"4", "d"},
            };

            int rowCount = 0;

            for (Object[] aBook : bookComments) {
                Row row = newSheet.createRow(++rowCount);

                int columnCount = 0;

                for (Object field : aBook) {
                    Cell cell = row.createCell(++columnCount);
                    if (field instanceof String) {
                        cell.setCellValue((String) field);
                    } else if (field instanceof Integer) {
                        cell.setCellValue((Integer) field);
                    }
                }

            }

            FileOutputStream outputStream = new FileOutputStream("JavaBooks.xlsx");
            workbook.write(outputStream);
            workbook.close();
            outputStream.close();

遍历数据框并获取索引

1 个答案: