Question

我正在尝试使用pyspark中的databricks脚本在mongodb集合中获取文档。我正在尝试获取每天的数据。脚本可以在几天内正常工作，但有时它会在一段时间内引发以下错误。

com.mongodb.MongoInternalException: The reply message length 14484499 is less than the maximum message length 4194304.

不确定此错误是什么以及如何解决。任何帮助表示赞赏。

这是我正在运行的示例代码：

pipeline = [{'$match':{'$and':[{'UpdatedTimestamp':{'$gte': 1555891200000}},
                               {'UpdatedTimestamp':{'$lt': 1555977600000}}]}}]

READ_MSG = spark.read.format("com.mongodb.spark.sql.DefaultSource")
               .option("uri",connectionstring)
               .option("pipeline",pipeline)
               .load()

日期时间以纪元格式提供。

Answer 1

这不是评论，而是答案（我没有足够的声誉来发表评论）。

我有同样的问题。经过一番研究后，我发现是由我的嵌套字段“调查”所构成的问题超过一个子级别，因为我能够通过选择除此字段之外的所有其他字段来读取数据库：

root
 |-- _id: string (nullable = true)
 |-- _t: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- address: struct (nullable = true)
 |    |-- streetAddress1: string (nullable = true)
 |-- survey: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- SurveyQuestionId: string (nullable = true)
 |    |    |-- updated: array (nullable = true)
 |    |    |    |-- element: long (containsNull = true)
 |    |    |-- value: string (nullable = true)

有人在解决似乎是mongodb spark连接器错误的解决方法吗？

Answer 2

在mongo db连接字符串中添加appName之后，该问题似乎已解决。我现在没有收到此错误。

com.mongodb.MongoInternalException：回复消息长度小于最大消息长度4194304

2 个答案: