Pymongo-spark:关于解码BSON字符串的{BsonSerializationException

时间:2016-06-26 06:52:00

标签: mongodb apache-spark pyspark pymongo bson

我通过IPython笔记本运行一些PySpark代码,该笔记本加载并处理三个Mongo集合作为RDD,然后合并它们(使用unionAll和dropDuplicates),将合并的RDD转换为DataFrame并将结果写入CSV

Spark工作失败了,显然是因为Pymongo-spark无法加载几个文档。 如何忽略任何错误文档,或添加try / except块以忽略此异常,异常意味着什么?使用sc.mongoRDD(database_uri)加载RDD并不允许我插入任何错误处理逻辑。

我在一些任务上遇到了这个例外:

org.bson.BsonSerializationException: While decoding a BSON string found a size that is not a positive number: 0
at org.bson.io.ByteBufferBsonInput.readString(ByteBufferBsonInput.java:107)
at org.bson.BsonBinaryReader.doReadString(BsonBinaryReader.java:223)
at org.bson.AbstractBsonReader.readString(AbstractBsonReader.java:430)
at org.bson.codecs.StringCodec.decode(StringCodec.java:39)
at org.bson.codecs.StringCodec.decode(StringCodec.java:28)
at com.mongodb.DBObjectCodec.readValue(DBObjectCodec.java:306)
at com.mongodb.DBObjectCodec.readDocument(DBObjectCodec.java:345)
at com.mongodb.DBObjectCodec.readValue(DBObjectCodec.java:286)
at com.mongodb.DBObjectCodec.readArray(DBObjectCodec.java:333)
at com.mongodb.DBObjectCodec.readValue(DBObjectCodec.java:289)
at com.mongodb.DBObjectCodec.readDocument(DBObjectCodec.java:345)
at com.mongodb.DBObjectCodec.readValue(DBObjectCodec.java:286)
at com.mongodb.DBObjectCodec.readDocument(DBObjectCodec.java:345)
at com.mongodb.DBObjectCodec.decode(DBObjectCodec.java:136)
at com.mongodb.DBObjectCodec.decode(DBObjectCodec.java:61)
at com.mongodb.CompoundDBObjectCodec.decode(CompoundDBObjectCodec.java:43)
at com.mongodb.CompoundDBObjectCodec.decode(CompoundDBObjectCodec.java:27)
at com.mongodb.connection.ReplyMessage.<init>(ReplyMessage.java:57)
at com.mongodb.connection.QueryProtocol.execute(QueryProtocol.java:305)
at com.mongodb.connection.QueryProtocol.execute(QueryProtocol.java:54)
at com.mongodb.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:159)
at com.mongodb.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:286)
at com.mongodb.connection.DefaultServerConnection.query(DefaultServerConnection.java:209)
at com.mongodb.operation.FindOperation$1.call(FindOperation.java:493)
at com.mongodb.operation.FindOperation$1.call(FindOperation.java:480)
at com.mongodb.operation.OperationHelper.withConnectionSource(OperationHelper.java:239)
at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:212)
at com.mongodb.operation.FindOperation.execute(FindOperation.java:480)
at com.mongodb.operation.FindOperation.execute(FindOperation.java:77)
at com.mongodb.Mongo.execute(Mongo.java:772)
at com.mongodb.Mongo$2.execute(Mongo.java:759)
at com.mongodb.DBCursor.initializeCursor(DBCursor.java:851)
at com.mongodb.DBCursor.hasNext(DBCursor.java:152)
at com.mongodb.hadoop.input.MongoRecordReader.nextKeyValue(MongoRecordReader.java:78)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:420)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:249)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)

作为参考,我使用以下提交参数在四个worker上运行Spark 1.5.1:

export PYSPARK_SUBMIT_ARGS="--master spark://<IP_ADDRESS>:7077 --executor-memory 18g --driver-memory 4g --num-executors 4 --executor-cores 6 --conf spark.cores.max=24 --conf spark.driver.maxResultSize=4g --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=/home/tao/eventLogging --jars /usr/local/spark/lib/mongo-hadoop-spark-1.5.0.jar --driver-class-path /usr/local/spark/lib/mongo-hadoop-spark-1.5.0.jar --packages com.stratio.datasource:spark-mongodb_2.10:0.11.0 --packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell"

我使用ipython notebook --profile=pyspark启动了一个IPython笔记本来运行以下代码(简化):

hdfs_path = 'hdfs://<IP address>/path/to/training_data_folder'

rdd1 = sc.mongoRDD(database_uri1)\
         .map(select_certain_fields1)\
         .filter(lambda doc: len(doc.keys()))\
         .map(add_a_field)\
         .map(doc_to_tuple)\
         .filter(len)

rdd2 = sc.mongoRDD(database_uri2)\
         .map(select_certain_fields2)\
         .filter(lambda doc: len(doc.keys()))\
         .map(add_a_field)\
         .map(doc_to_tuple)\
         .filter(len)

rdd3 = sc.mongoRDD(database_uri3)\
         .map(select_certain_fields3)\
         .filter(lambda doc: len(doc.keys()))\
         .map(add_a_field)\
         .map(doc_to_tuple)\
         .filter(len)

df1 = sqlContext.createDataFrame(rdd1, schema)
df2 = sqlContext.createDataFrame(rdd2, schema)
df3 = sqlContext.createDataFrame(rdd3, schema)

df_all = df2.unionAll(df3).unionAll(df1)\
                   .dropDuplicates(['caseId', 'timestamp'])

df_all.write.format('com.databricks.spark.csv') \
    .save(hdfs_path + '/all_notes_1.csv')

谢谢!

0 个答案:

没有答案