在SPARK中操作RDD,每3行合并为一行

时间:2018-08-03 07:40:15

标签: scala apache-spark rdd

我现在有数据的副本,每行的数据如下。

A
B
C
QW
OO
P
...

现在,我希望每三行合并一次,如下所示:

ABC
QWOOP
...

该功能该怎么写?

eg. val data = sc.textFile("path")

谢谢!

1 个答案:

答案 0 :(得分:-1)

val lineRdd = sc.textFile("path")

val yourRequiredRdd = lineRdd
  .zipWithIndex
  .map({ case (line, index) => (index / 3, (index, line)))
  .aggregateByKey(List.empty[(Long, String)])(
    { case (aggrList, (index, line)) => (index, line) :: aggrList },
    { case (aggrList1, aggrList2) => aggrList1 ++ aggrList2 }
  )
  .map({ case (key, aggrList) =>
    aggrList
      .sortBy({ case (index, line) => index })
      .map({ case (index, line) => line })
      .mkString("")
  })