我是Spark的新手,我想对流媒体中的文件进行处理。
我有文件csv不停地到达:
示例csv文件:
world world
count world
world earth
count world
我想对他们进行两次治疗:
第一种治疗方法是这样的结果:
(world,2,2) // word is twice repeated for the first column and distinct (world,earth) for second therefore (2,2)
(count,2,1) // word is twice repeated for the first column and not distinct (world,world) for second therefore (2,1)
第二个结果
我想在每个小时后得到那个结果。在我们的例子中:
(world,1) // 1=2/2
(count,2) //2=2/1
这是我的代码:
val conf = new SparkConf()
.setAppName("File Count")
.setMaster("local[2]")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(10m))
val file = ssc.textFileStream("hdfs://192.168.1.31:8020/user/sparkStreaming/input")
var result = file.map(x => (x.split(" ")(0)+";"+x.split(" ")(1), 1)).reduceByKey((x,y) => x+y)
val window = result.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(60), Seconds(20))
val result1 = window.map(x => x.toString )
val result2 = result1.map(line => line.split(";")(0)+","+line.split(",")(1))
val result3 = result2.map(line => line.substring(1, line.length-1))
val result4 = result3.map(line => (line.split(",")(0),line.split(",")(1).toInt ) )
val result5 = result4.reduceByKey((x,y) => x+y )
val result6 = result3.map(line => (line.split(",")(0), 1 ))
val result7 = result6.reduceByKey((x,y) => x+y )
val result8 = result7.join(result5) // (world,2,2)
val finalResult = result8.mapValues(x => x._1.toFloat / x._2 ) // (world,1), I want this result after every one hour
ssc.start()
ssc.awaitTermination()
提前致谢!!!