Spark Streaming多次处理相同的数据

时间:2018-07-12 12:46:00

标签: python mongodb apache-spark apache-kafka spark-streaming

使用Spark流处理数据流并将其存储在mongoDB中时遇到问题;场景如下:有一个发布者发送一些数据(例如,机器人的轮子的角度和行进的距离),还有一个消费者,通过kafka消费了这些数据,并通过Spark流技术对其进行处理(计算坐标在XY平面中),然后将其存储在mongoDB中;特别是,问题如下: 尽管该消息仅消耗一次,并且直接流中只有一个RDD,但该消息却被处理了3次,因此有3次更新而不是1次。 仅当我将数据存储在mongoDB上时才会发生这种情况,相反,如果我仅对pprint()进行“详细说明”,则不会发生这种情况。 现在我显示代码:

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils 
import math
import time
from pyspark import SparkConf
import pymongo_spark
# Important: activate pymongo_spark.
pymongo_spark.activate()

startInformation = {'robot id':'r','x coordinate':'0','y     coordinate':'0','speed':'0','delta space':'s','theta twist':'t','timeStamp':'ts'}
oldX = ''
oldY = ''

# Create a local StreamingContext with two working thread and batch     interval of 3 second
sc = SparkContext("local[2]", "OdometryConsumer")
ssc = StreamingContext(sc, 3)

kafkaStream = KafkaUtils.createDirectStream(ssc, ['odometry'], {'metadata.broker.list': 'localhost:9092'})

def getPositionSpeed(line):
    fr = open('/home/erca/Scrivania/proveTesi/info.txt', 'r')
    for l in fr.readlines():
        oldX = float(l.split(' ')[0])
        oldY = float(l.split(' ')[1])
        try:
            oldTs = int(l.split(' ')[2])
        except:
            oldTs = int(time.time())
    fr.close()
    fields = line[1].split(" ")
    robotId = fields[0].split(":")[1]
    deltaSpace = float(fields[1].split(":")[1])
    thetaTwist = float(fields[2].split(":")[1])
    ts = int(fields[3].split(":")[1])
    newX = oldX + deltaSpace*(math.cos(thetaTwist))
    newY = oldY + deltaSpace*(math.sin(thetaTwist))
    print("******************************** vecchio ts: " + str(oldTs)) 
    print("******************************** nuovo ts: " + str(ts))
    print("******************************** spazio: " + str(deltaSpace))
    print("******************************** angolo: " + str(thetaTwist))

    try:
        speed = (float(deltaSpace))/(float(ts - oldTs))
    except Exception as e: 
        speed = str(e)
        #speed = float(9999999999999)

    fw = open('/home/erca/Scrivania/proveTesi/info.txt', 'w')   
        fw.write(str(newX) + " " + str(newY) + " " + str(ts))
    fw.close()

    startInformation['robot id'] = robotId
    startInformation['x coordinate'] = newX
    startInformation['y coordinate'] = newY
    startInformation['speed'] = speed
    startInformation['delta space'] = deltaSpace
    startInformation['theta twist'] = thetaTwist
    startInformation['timeStamp'] = ts

    print("-------------------------------" + str(startInformation) + "-------------------------------")

    return startInformation

elaborate = kafkaStream.map(getPositionSpeed)
#elaborate.pprint()

def sendRecord(rdd):
    try:                                    
        #rdd.pprint()
        rdd.saveToMongoDB('mongodb://localhost:27017/marco.odometry')
    except:
        pass

elaborate.foreachRDD(sendRecord)

ssc.start()             # Start the computation
ssc.awaitTermination()  # Wait for the computation to terminate

谁可以帮助我?谢谢

0 个答案:

没有答案