Question

我使用Maxwell's Daemon的输出来捕获MySQL数据库上发生的更改。它表示更改为嵌套的JSON字段：'data'包含表的最新快照，'old'表示更改的字段。

当我将这个JSON读入Spark DataFrame时，'old'的所有缺失字段都设置为'null'。

这是一个不幸的情况，因为我没有办法告诉字段是否从'null'更改为'[some_value]'，或者某些其他字段已更改为行和'null'表示JSON中缺少的字段。

以下是一个例子：

from pyspark.sql.types import StructType, StructField, StringType, BooleanType, LongType

custom_schema = StructType(
[StructField("type", StringType(), True),
 StructField("ts", LongType(), True),
 StructField("xid", LongType(), True),
 StructField("data", StructType([
     StructField("id", LongType(), True),
     StructField("bought_by", StringType(), True),
     StructField("userprofile_id", StringType(), True)]), True),
 StructField("old", StructType([
     StructField("id", LongType(), True),
     StructField("bought_by", StringType(), True),
     StructField("userprofile_id", StringType(), True)]), True)]
)

source_list = [
'{"type":"update","ts":1510901244,"xid":1,"data":{"id":1,"bought_by":"user:1","userprofile_id":1}, "old":{"userprofile_id":null}}', 
'{"type":"update","ts":1510901245,"xid":2,"data":{"id":1,"bought_by":"user:1","userprofile_id":null}, "old":{"userprofile_id":2}}',
'{"type":"update","ts":1510901246,"xid":3,"data":{"id":1,"bought_by":"user:1","userprofile_id":1}, "old":{"userprofile_id":2}}',
'{"type":"update","ts":1510901246,"xid":4,"data":{"id":1,"bought_by":"user:1","userprofile_id":1}, "old":{"bought_by":"user:2"}}',
]

df = spark.read.json(spark.sparkContext.parallelize(source_list), schema=custom_schema)

df.show()

这个输出是：

+------+----------+---+---------------+------------------+
|  type|        ts|xid|           data|               old|
+------+----------+---+---------------+------------------+
|update|1510901244|  1|   [1,user:1,1]|  [null,null,null]|
|update|1510901245|  2|[1,user:1,null]|     [null,null,2]|
|update|1510901246|  3|   [1,user:1,1]|     [null,null,2]|
|update|1510901246|  4|   [1,user:1,1]|[null,user:2,null]|
+------+----------+---+---------------+------------------+

但是我希望产生这样的东西：

+------+----------+---+---------------+--------------------+
|  type|        ts|xid|           data|                 old|
+------+----------+---+---------------+--------------------+
|update|1510901244|  1|   [1,user:1,1]|  ['N/A','N/A',null]|
|update|1510901245|  2|[1,user:1,null]|     ['N/A','N/A',2]|
|update|1510901246|  3|   [1,user:1,1]|     ['N/A','N/A',2]|
|update|1510901246|  4|   [1,user:1,1]|['N/A',user:2,'N/A']|
+------+----------+---+---------------+--------------------+

我花了很长时间寻找解决方案，但我只找到了解释这种情况的文章，其中'null'值表示缺少的字段，以及将所有'null'值替换为其他值的解决方案，但在我的情况下，这些都没有帮助。

我现在最接近的解决方案是：

由于我们使用Gobblin提取数据，我们会添加一个替换

的规则

“userprofile_id”：null with“userprofile_id”： - 1

或用于字符串值替换

“string_field”：带“string_field”的null：“N / A”

但是可伸缩性太过于苛刻。

非常感谢任何解决此问题的帮助。谢谢！

Answer 1

我最终在源文件中读取RDD，为＆＃39; null＆＃39;做了字符串替换。我想要识别的字段（使用默认值），将结果写出到临时位置并将内容作为DataFrame读回。然后我在我的代码中稍后处理默认值为Null。这非常难看，但它确实有效。

NULL_USERPROFILE_ID = -1234321

in_file = spark.sparkContext.textFile(source + "*")
rdd = in_file.map(lambda x: x.replace('"userprofile_id":null', '"userprofile_id":%d' % NULL_USERPROFILE_ID))
rdd.saveAsTextFile(destination)

我很乐意重构它，如果我得到一个线索，如何在将其读入DataFrame时为缺少的JSON字段分配默认值。

Pyspark - 使用除

1 个答案: