我一直在尝试mongo-hadoop
读取一些MongoDB转储,但到目前为止,我仍在阅读这些转储(首先,不确定是否在路径内选择了转储)。
如标题中所述,每当我尝试收集RDD的内容时,它就会抛出java.lang.NegativeArraySizeException
。转储文件格式为.gzip
(在目录中也可以将其解压缩)
val mongoConfig = new Configuration()
mongoConfig.set("mapred.input.dir", "/Users/example/dir_with_dumps")
mongoConfig.set("mongo.job.input.format", "com.mongodb.hadoop.BSONFileInputFormat")
val documents = sparkContext.newAPIHadoopRDD(
mongoConfig,
classOf[BSONFileInputFormat],
classOf[Object],
classOf[BSONObject]
)
documents.collect()
以及生成的堆栈跟踪...
Exception in thread "main" java.lang.NegativeArraySizeException
at java.util.Arrays.copyOf(Arrays.java:3236)
at org.bson.LazyBSONDecoder.decode(LazyBSONDecoder.java:59)
at com.mongodb.hadoop.splitter.BSONSplitter.splitFile(BSONSplitter.java:240)
at com.mongodb.hadoop.splitter.BSONSplitter.readSplitsForFile(BSONSplitter.java:205)
at com.mongodb.hadoop.BSONFileInputFormat.getSplits(BSONFileInputFormat.java:128)
at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:127)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1337)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
任何线索如何解决此问题?可能与格式有关吗?
我已经看到了相关的问题PySpark: Empty RDD on reading gzipped BSON files,但不适用于此,因为与Scala一起使用的.BSONFileRDD
尚无任何问题。