Question

我在MongoDB中持有相对较大的文档，我只需将文档的一小部分信息加载到Spark Dataframe中即可使用。这是一个文档的例子（为了这个问题的可读性，我已经删除了很多不必要的字段）

root
     |-- _id: struct (nullable = true)
     |    |-- oid: string (nullable = true)
     |-- customerInfo: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- events: array (nullable = true)
     |    |    |    |-- element: struct (containsNull = true)
     |    |    |    |    |-- relevantField: integer (nullable = true)
     |    |    |    |    |-- relevantField_2: string (nullable = true)
     |    |    |-- situation: array (nullable = true)
     |    |    |    |-- element: struct (containsNull = true)
     |    |    |    |    |-- currentRank: integer (nullable = true)
     |    |    |-- info: struct (nullable = true)
     |    |    |    |-- customerId: integer (nullable = true)

我现在所做的就是爆炸＆＃34; customerInfo＆＃34;：

   val df = MongoSpark.load(sparksess)
    val new_df = df.withColumn("customerInfo", explode(col("customerInfo")))
                     .select(col("_id"), 
        col("customerInfo.situation").getItem(13).getField("currentRank").alias("currentRank"),
                     col("customerInfo.info.customerId"),
                     col("customerInfo.events.relevantField"),
                     col("customerInfo.events.relevantField_2"))

现在，根据我的理解，这会加载整个＆＃34; customerInfo＆＃34;进入记忆中对它做行动这是浪费时间和资源，我怎么才能只爆炸我需要的具体信息？谢谢！

Answer 1

我怎么才能只爆炸我需要的具体信息？

使用Filters在将数据发送到Spark之前先过滤MongoDB中的数据。 MongoDB Spark Connector将构造Aggregation Pipeline以仅将过滤后的数据发送到Spark，从而减少数据量。

您可以使用$project聚合阶段仅投影某些字段。另请参阅MongoDB Spark Connector: Filters and Aggregation

只将MongoDB文档的一部分转换为Spark Dataframe

1 个答案: