拆分rdd和Select元素

时间:2017-11-17 06:07:27

标签: scala apache-spark

我正在尝试捕获流,转换数据,然后将其保存在本地。

到目前为止,流媒体和写作工作正常。然而,转型只在中途奏效。 我收到的流包含9个以“|”分隔的列。所以我想拆分它,让我们说选择列1,3和5.我试过的看起来像这样,但没有真正让结果

val indices =  List(1,3,5)        
linesFilter.window(Seconds(EVENT_PERIOD_SECONDS*WRITE_EVERY_N_SECONDS), Seconds(EVENT_PERIOD_SECONDS*WRITE_EVERY_N_SECONDS)).foreachRDD { (rdd, time) =>
            if (rdd.count() > 0) {
          rdd
          .map(_.split("\\|").slice(1,2))
            //.map(arr => (arr(0), arr(2)))) 
            //filter(x=> indices.contains(_(x)))) //selec(indices)
            //.zipWithIndex
           .coalesce(1,true)

//the replacement is is used so that I get a csv file at the end
            //.map(_.replace(DELIMITER_STREAM, DELIMITER_OUTPUT))
            //.map{_.mkString(DELIMITER_OUTPUT) }
            .saveAsTextFile(CHECKPOINT_DIR + "/output/o_" + sdf.format(System.currentTimeMillis()))
}

有没有人提示如何拆分rdd然后只抓取特定元素呢?

编辑输入:

val lines = streamingContext.socketTextStream(HOST, PORT)
val linesFilter = lines
          .map(_.toLowerCase)
          .filter(_.split(DELIMITER_STREAM).length == 9)

输入流是:

536365|71053|white metal lantern|6|01-12-10 8:26|3,39|17850|united kingdom|2017-11-17 14:52:22

1 个答案:

答案 0 :(得分:0)

非常感谢大家。

正如您的建议,我修改了我的代码:

private val DELIMITER_STREAM = "\\|"

val linesFilter = lines
          .map(_.toLowerCase)
          .filter(_.split(DELIMITER_STREAM).length == 9)
          .map(x =>{
            val y = x.split(DELIMITER_STREAM)
            (y(0),y(1),y(3),y(4),y(5),y(6),y(7))})

然后在RDD

if (rdd.count() > 0) {

  rdd
    .map(_.productIterator.mkString(DELIMITER_OUTPUT))
    .coalesce(1,true)
    .saveAsTextFile(CHECKPOINT_DIR + "/output/o_" + sdf.format(System.currentTimeMillis()))

  }