读取MongoDB转储时出现'java.lang.NegativeArraySizeException'异常

时间:2019-04-18 10:55:58

标签: scala apache-spark hadoop

我一直在尝试mongo-hadoop读取一些MongoDB转储,但到目前为止,我仍在阅读这些转储(首先,不确定是否在路径内选择了转储)。

如标题中所述,每当我尝试收集RDD的内容时,它就会抛出java.lang.NegativeArraySizeException。转储文件格式为.gzip(在目录中也可以将其解压缩)

        val mongoConfig = new Configuration()
        mongoConfig.set("mapred.input.dir", "/Users/example/dir_with_dumps")
        mongoConfig.set("mongo.job.input.format", "com.mongodb.hadoop.BSONFileInputFormat")

        val documents = sparkContext.newAPIHadoopRDD(
          mongoConfig,
          classOf[BSONFileInputFormat],
          classOf[Object],
          classOf[BSONObject]
        )

        documents.collect()

以及生成的堆栈跟踪...

Exception in thread "main" java.lang.NegativeArraySizeException
at java.util.Arrays.copyOf(Arrays.java:3236)
at org.bson.LazyBSONDecoder.decode(LazyBSONDecoder.java:59)
at com.mongodb.hadoop.splitter.BSONSplitter.splitFile(BSONSplitter.java:240)
at com.mongodb.hadoop.splitter.BSONSplitter.readSplitsForFile(BSONSplitter.java:205)
at com.mongodb.hadoop.BSONFileInputFormat.getSplits(BSONFileInputFormat.java:128)
at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:127)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1337)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.take(RDD.scala:1331)

任何线索如何解决此问题?可能与格式有关吗?

我已经看到了相关的问题PySpark: Empty RDD on reading gzipped BSON files,但不适用于此,因为与Scala一起使用的.BSONFileRDD尚无任何问题。

0 个答案:

没有答案