Question

我想将数据存储在spark中，使得差异为5秒或更短的时间戳与相应的数据一起落入一个5秒的存储桶中。同样，下一组5秒桶与剩余的日志。（这样我就可以在桶中聚合数据）。我的日志：

1472120400.107 HTTP GEO er.aujf.csdh.jkhydf.eyrgt
1472120399.999 HTTP GEO er.asdhff.cdn.qyirg.sdgsg
1472120397.633 HTTP GEO er.abff.kagsf.weyfh.ajfg
1472120397.261 HTTP GEO er.laffg.ayhrff.agyfr.yawr
1472120394.328 HTTP GEO er.qfryf.aqwruf.oiuqwr.agsf
1472120393.737 HTTP GEO er.aysf.aouf.ujaf.casf
.
.
.

我仍然无法弄清楚如何在火花中做到这一点。

时间戳为1472120400.107,1472120399.999,1472120397.633,1472120397.261等的日志分为一个桶，下一个桶中的下一个等等。

输出：

所有带有时间戳的日志行1472120400.107,1472120399.999,1472120397.633,1472120397.261将保存在内存中（一个存储桶），以便进行进一步处理，例如查找整个存储桶的计数。同样，下一个桶。

Answer 1

只需按您要创建的粒度划分时间戳。在binRDD中将bin编号保存为密钥，其中data是输入，然后是reduceByKey。

我将在Scala中编写代码示例，基本上将其转换为python是微不足道的我想说明一点。

val l5 = List("1472120400.107 HTTP GEO er.aujf.csdh.jkhydf.eyrgt", "1472120399.999 HTTP GEO er.asdhff.cdn.qyirg.sdgsg") 
val l5RDD = sc.parallelize(l5) //input as RDD
val l5tmp = l5RDD.map(item => item.split(" ")) //Split the sentence
val l5tmp2 = l5tmp.map(item => ((item(0).toDouble/3600000).toInt, List(item))) //Map the data to a bin (in the key) according to the wanted granularity
val collected = l5tmp2.reduceByKey(_ ++ _) //Collect the lists to create the bins of data
collected.collect().foreach(println) //Prints (408,List([Ljava.lang.String;@2c6aed22, [Ljava.lang.String;@e322ec9)) - means that both entries collected to a bin named 408

在apache spark中创建存储桶

1 个答案: