我正在编写一个火花流应用程序来从s3流式传输数据,进行一些聚合并引发适当的错误。因为我一直收到这个错误而陷入困境:
Traceback (most recent call last):
File "/home/plivo/Downloads/spark-1.4.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/streaming/util.py", line 59, in call
return r._jrdd
File "/home/plivo/Downloads/spark-1.4.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/rdd.py", line 2351, in _jrdd
pickled_cmd, bvars, env, includes = _prepare_for_python_RDD(self.ctx, command, self)
File "/home/plivo/Downloads/spark-1.4.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/rdd.py", line 2271, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/home/plivo/Downloads/spark-1.4.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py", line 427, in dumps
return cloudpickle.dumps(obj, 2)
File "/home/plivo/Downloads/spark-1.4.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 622, in dumps
cp.dump(obj)
File "/home/plivo/Downloads/spark-1.4.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 111, in dump
raise pickle.PicklingError(msg)
PicklingError: Could not pickle object as excessively deep recursion required.
以下是我正在尝试的代码
import time
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
if __name__ == '__main__':
limit = {'111111':200,'222222':100,'333333':100,'444444':100,'555555':100, '666666':100,}
current_value = { str(x)*6 : [ int(time.time())/60, 0 ] for x in range(1, 7) }
def check(x):
response = client.put_object(Key = 'key', Body = 'body', Bucket = 'bucket')
return True
sc = SparkContext('local[2]', 's3_streaming')
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId","<key>")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey","<key>")
ssc = StreamingContext(sc, 10)
rdd = ssc.textFileStream('s3n://sparktest01')
pairs = rdd.map(lambda x: (x.split(',')[0], int(x.split(',')[3])))
aggr = pairs.reduceByKey(lambda x,y: int(x) + int(y))
final = aggr.map(lambda x: (x, check(x) ) )
final.pprint()
ssc.start()
ssc.awaitTermination()