我正在尝试将一些相当复杂的嵌套json拆分为更合理的格式,但是我正在努力扩展密钥来更改整个数据集中的名称。
我的数据集看起来像这样:
import requests
from bs4 import BeautifulSoup
url = 'https://www.the-blockchain.com/2018/06/29/mcafee-labs-report-6x-increase-in-crypto-mining-malware-incidents-in-q1-2018/'
res = requests.get(url,headers={"User-Agent":"defined"})
soup = BeautifulSoup(res.text, 'lxml')
paragraph = [p.text for p in soup.select('.td-post-content p span')]
print(paragraph)
我无法控制架构,因此只能使用给出的内容。本质上,有效负载分为各种检查,但是每个检查都有唯一的名称(样本有效负载中的abc123,xzy7892和foobar1012387)。
帐户,帐户ID和支票键可直接从数据框中选择。
{
"account": {
"accountID": "test_account",
"name": "abc123",
"checks": {
"abc123": {
"check1": "pass",
"check2": "fail",
"check3": 0
},
"xzy7892": {
"check1": "pass",
"check2": "fail",
"check3": 0,
"result": {
"item1": 1,
"item2": 2
}
},
"foobar11012387": {
"check1": "fail",
"check2": "pass",
"check3": 0,
"result": {
"item1": 1,
"item2": 2
}
}
}
}
}
但是我可以做得更多(例如account.checks.abc123.check1)。最终,我想将这三个检查合理化到数据框中它们各自的行中,但是由于检查键更改,因此我不确定该如何进行。
df2.select(['account.accountID', 'account.checks']).show()
+------------+--------------------+
| accountID| checks|
+------------+--------------------+
|test_account|[[pass, fail, 0],...|
+------------+--------------------+
我希望DF看起来与上表类似(我没有扩展结果,但是我可以进一步)。我不知道测试的名称(例如abc123,xzy7892),而且它们确实会更改,因此也许我需要先构建一个数组。
有什么想法吗?
答案 0 :(得分:0)
如果您输入的dataframe
和schema
如下
+-------------------------------------------------------------------------------------------+
|account |
+-------------------------------------------------------------------------------------------+
|[test_account, [[pass, fail, 0], [fail, pass, 0, [1, 2]], [pass, fail, 0, [1, 2]]], abc123]|
+-------------------------------------------------------------------------------------------+
root
|-- account: struct (nullable = true)
| |-- accountID: string (nullable = true)
| |-- checks: struct (nullable = true)
| | |-- abc123: struct (nullable = true)
| | | |-- check1: string (nullable = true)
| | | |-- check2: string (nullable = true)
| | | |-- check3: long (nullable = true)
| | |-- foobar11012387: struct (nullable = true)
| | | |-- check1: string (nullable = true)
| | | |-- check2: string (nullable = true)
| | | |-- check3: long (nullable = true)
| | | |-- result: struct (nullable = true)
| | | | |-- item1: long (nullable = true)
| | | | |-- item2: long (nullable = true)
| | |-- xzy7892: struct (nullable = true)
| | | |-- check1: string (nullable = true)
| | | |-- check2: string (nullable = true)
| | | |-- check3: long (nullable = true)
| | | |-- result: struct (nullable = true)
| | | | |-- item1: long (nullable = true)
| | | | |-- item2: long (nullable = true)
| |-- name: string (nullable = true)
您可以按以下方式使用struct
array
和explode
函数来获得所需的输出
checks1 = ['abc123', 'foobar11012387', 'xzy7892']
checks2 = ['check1', 'check2', 'check3']
from pyspark.sql import functions as f
df.select(f.col('account.accountID'), f.explode(f.array(*[f.struct([f.col('account.checks.'+y+'.'+x).cast('string').alias(y) for y in checks1]).alias(x) for x in checks2])).alias('temp'))\
.select(f.col('accountID'), f.col('temp.*'))\
.show(truncate=False)
应该给您
+------------+------+--------------+-------+
|accountID |abc123|foobar11012387|xzy7892|
+------------+------+--------------+-------+
|test_account|pass |fail |pass |
|test_account|fail |pass |fail |
|test_account|0 |0 |0 |
+------------+------+--------------+-------+
我希望答案会有所帮助