我目前有一个用&号分隔的文本字符串,我需要使用PySpark进行解析以将键值对提取到数组/字典中。我可以使用字符串中的大多数标签来执行此操作,但问题是某些键具有索引,并且每个记录可能有所不同,但是键始终相同(如果这样的话)。要做的是,遍历字符串
示例输入:
"{pr1nm=Apples&pr1id=1111111&pr1pr=200.00&pr1qt=1&pr2nm=Pears&pr2id=1111112&pr2pr=300.00&pr2qt=2}"
所需的输出:
[{
"ProductName":"Apples",
"ProductId": "1111111",
"ProductPrice":"200.00",
"ProductQuantity":"1",
"ProductName":"Pears",
"ProductId":"1111112",
"ProductPrice":"300.00"
"ProductQuantity":"2"
}]
如果我在同一字符串中还有其他与产品无关的标签,例如:
"{dl=https://stackoverflow.com/posts/XXXXX&t=pageview&pr1nm=Apples&pr1id=1111111&pr1pr=200.00&pr1qt=1&pr2nm=Pears&pr2id=1111112&pr2pr=300.00&pr2qt=2}"
对于嵌套数组中的产品,输出应该是这样的:
{"DocumentLocation":"https://stackoverflow.com/posts/XXXXX",
"HitType":"pageview",
"Products": [{
"ProductName":"Apples",
"ProductId": "1111111",
"ProductPrice":"200.00",
"ProductQuantity":"1",
"ProductName":"Pears",
"ProductId":"1111112",
"ProductPrice":"300.00"
"ProductQuantity":"2"
}]
}
答案 0 :(得分:1)
您可以使用str_to_map
将字符串转换为map列,如下所示:
df = df.withColumn("input", expr("ltrim('{', rtrim('}', input))"))\
.withColumn("input", expr("str_to_map(input, '&', '=')"))
df.show(truncate=False)
+-------------------------------------------------------------------------------------------------------------------------------+
|input |
+-------------------------------------------------------------------------------------------------------------------------------+
|[pr1nm -> Apples, pr1id -> 1111111, pr1pr -> 200.00, pr1qt -> 1, pr2nm -> Pears, pr2id -> 1111112, pr2pr -> 300.00, pr2qt -> 2]|
+-------------------------------------------------------------------------------------------------------------------------------+
然后,如果需要JSON字符串,请使用to_json
函数:
df.withColumn("input", to_json(col("input"))) \
.show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------+
|input |
+--------------------------------------------------------------------------------------------------------------------------------+
|{"pr1nm":"Apples","pr1id":"1111111","pr1pr":"200.00","pr1qt":"1","pr2nm":"Pears","pr2id":"1111112","pr2pr":"300.00","pr2qt":"2"}|
+--------------------------------------------------------------------------------------------------------------------------------+