Question

我正在尝试使用mongo-hadoop连接器将数据读入spark。问题是，如果我试图设置有关数据读取的限制，我会在RDD中获得限制*分区数。

mongodbConfig.set("mongo.job.input.format","com.mongodb.hadoop.MongoInputFormat");
mongodbConfig.set("mongo.input.uri", "mongodb://localhost:27017/test.restaurants");
mongodbConfig.set("mongo.input.limit","3");
JavaPairRDD<Object, BSONObject> documents = sc.newAPIHadoopRDD(
            mongodbConfig,            // Configuration
            MongoInputFormat.class,   // InputFormat: read from a live cluster.
            Object.class,             // Key class
            BSONObject.class          // Value class
    );

    long count = documents.count();
    System.out.println("Collection Count: " + count);
    System.out.println("Partitions: " + documents.partitions().size());

//9 elements in the RDD = limit * nrOfPartions = 3 * 3
//3 partitions

此行为可以重现其他限制（我总是得到限制* 3）。

如果我尝试简单地通过objectId进行查询，我会得到类似的行为（它创建一个RDD，具有相同的对象*分区数 - 在我的情况下，3个元素具有相同的文档）。

我还可以提供用于创建mongo集合的脚本，如果它有用的话。

Answer 1

这是一个功能而非错误。 mongo.input.limit用于为limit parameter设置MongoInputSplit，因此它不是全局地按分区应用。

通常，全局限制获取记录的数量是不可能的（或准确地说是实际的）。每个拆分都是独立处理的，通常没有关于每次拆分产生的记录数量的先验知识。

用于Spark重复的MongoHadoop Connector按分区数量生成

1 个答案: