Question

如何在将Mongodb集合映射到Spark数据帧时避免冲突数据类型。我们无法将冲突数据类型转换为字符串。并且在执行选择操作时出错。

我使用的是Mongodb-Spark-connector v2.10：1.0.0

Answer 1

ConflictType表示发现该字段包含不能强制转换为统一类型的不同数据类型。换句话说，它包含不同类型的数据。即数字和字符串或字符串。通过运行printSchema()检查ConflictType的数据框字段。

在MongoDB Spark Connector v1中，解决方法是手动将DataFrame的架构设置为string。

在MongoDB Spark Connector v2中，冲突类型的基本类型将在strings中。另见SPARK-84。

Answer 2

对于那些正在使用v1连接器并且需要执行手动解决方法的用户，这里是我创建的一些代码来解决此问题。

 val schema = com.mongodb.spark.MongoSpark
   .read(sql)
   .option("samplingRatio", "0.3")
   .option("spark.mongodb.input.uri", uri)
   .option("spark.mongodb.input.database", db)
   .option("spark.mongodb.input.collection", COLLECTION)
   .option("spark.mongodb.input.readPreference.name", "secondary")
   .load()

 // replace all instances of conflict with type string from the json
 var schema_json = schema.schema.json.replace("\"type\":\"conflict\"", "\"type\":\"string\"")

 //convert it into a struct object
 val new_schema = DataType.fromJson(schema_json).asInstanceOf[StructType]

 // then load the schema in with the conflict types removed and replaced with string
 val table = com.mongodb.spark.MongoSpark.read(sql)
 .schema(new_schema)
 .option("spark.mongodb.input.uri", uri)
 .option("spark.mongodb.input.database", db)
 .option("spark.mongodb.input.collection", COLLECTION)
 .option("spark.mongodb.input.readPreference.name", "secondary")
 .load()
 .repartition(6)

MongoDB Spark Conflict数据类型

2 个答案: