我想遍历DataFrame的流/列,获取当前的行/列索引,并执行一些其他操作。有什么方便的方法可以从行/列中获取索引号吗?
目标是通过Apache POI库将输出保存到xlsx文件,因此可能需要对每个单元进行迭代。
// proceed throught each row/column of the DataFrame
myDataframe.foreach{row =>
row.toSeq.foreach{col =>
val rowNum = row.???
val colNum = col.???
// further operations on the data...
// like save the output to the xlsx file with the Apache POI
}
}
我正在开发Spark 1.6.3。和Scala 2.10.5。
答案 0 :(得分:1)
您可以使用 row_number()添加索引:
val myDataframe = sc.parallelize(List("a", "b", "c", "d")).toDF("value")
val withIndex = myDataframe.select(row_number().over(Window.orderBy('value)).as("index").cast("INT"), '*)
myDataframe.foreach { row =>
for (i <- 0 until (row.length)) {
val rowNum = row.getInt(0)
val colNum = i
}
}
但是,如果要将df保存为excel文件,则应收集数据。 然后将其转换为数组数组/ 2d数组。
val list: Array[Array[String]] = withIndex
.select(concat_ws(",", withIndex.columns.map(withIndex(_)): _*))
.map(s => s.getString(0))
.collect()
.map(s => s.toString.split(","))
for (elem <- 0 until list.length) {
for (elem2 <- 0 until list.apply(elem).length) {
println(list.apply(elem).apply(elem2),", row:"+elem+", col:"+elem2)
}
}
(1,, row:0, col:0)
(a,, row:0, col:1)
(2,, row:1, col:0)
(b,, row:1, col:1)
(3,, row:2, col:0)
(c,, row:2, col:1)
(4,, row:3, col:0)
(d,, row:3, col:1)
我不知道apache poi在scala中是如何工作的,但是在Java中,它应该看起来像这样:
FileInputStream inputStream = new FileInputStream(new File(excelFilePath));
Workbook workbook = WorkbookFactory.create(inputStream);
Sheet newSheet = workbook.createSheet("spark");
// your data from DataFrame
Object[][] bookComments = {
{"1", "a"},
{"2", "b"},
{"3", "c"},
{"4", "d"},
};
int rowCount = 0;
for (Object[] aBook : bookComments) {
Row row = newSheet.createRow(++rowCount);
int columnCount = 0;
for (Object field : aBook) {
Cell cell = row.createCell(++columnCount);
if (field instanceof String) {
cell.setCellValue((String) field);
} else if (field instanceof Integer) {
cell.setCellValue((Integer) field);
}
}
}
FileOutputStream outputStream = new FileOutputStream("JavaBooks.xlsx");
workbook.write(outputStream);
workbook.close();
outputStream.close();