Question

我是火花新手。我想做火花流设置以检索以下格式文件的键值对：

file：info1

注意：每个信息文件将包含大约1000条这些记录。我们的系统不断生成这些信息文件。通过，火花流我想要做行号和信息文件的映射，并希望得到汇总结果。

我们可以向spark集群提供这类文件的输入吗？我对＆＃34; SF＆＃34;感兴趣和＆＃34; DA＆＃34;仅限分隔符，＆＃34; SF＆＃34;对应于源文件和＆＃34; DA＆＃34;对应（行号，计数）。

由于此输入数据不是行格式，因此最好将这些文件用于spark输入，或者我是否需要做一些中间阶段，我需要清理这些文件以生成新文件每个记录信息在行而不是块？

或者我们可以在Spark中实现这一目标吗？什么是正确的方法？

我想要实现的目标？我想获得行级信息。意味着获取行（作为键）和信息文件（作为值）

我想要的最终输出如下： line178 - ＆gt; （info1，info2，info7 .................）

第2908行 - ＆gt; （info3，info90，...，...，）

如果我的解释不清楚或者我遗失了什么，请告诉我。

谢谢＆amp;问候， Vinti

Answer 1

你可以这样做。拥有DStream流：

// this gives you DA & FP lines, with the line number as the key
val validLines =  stream.map(_.split(":")).
  filter(line => Seq("DA", "FP").contains(line._1)).
  map(_._2.split(","))
  map(line => (line._1, line._2))

// now you should accumulate values
val state = validLines.updateStateByKey[Seq[String]](updateFunction _)

def updateFunction(newValues: Seq[Seq[String]], runningValues: Option[Seq[String]]): Option[Seq[String]] = {
  // add the new values 
  val newVals = runnigValues match {
    case Some(list) => list :: newValues
    case _ => newValues
  }
  Some(newVals)
}

这应该为每个键累积一个具有相关值的序列，将其存储在状态

Spark Streaming Desinging Questiion

1 个答案: