Spark将嵌套的json拆分为行

时间:2018-07-01 11:43:47

标签: json apache-spark pyspark

我正在尝试将一些相当复杂的嵌套json拆分为更合理的格式,但是我正在努力扩展密钥来更改整个数据集中的名称。

我的数据集看起来像这样:

import requests
from bs4 import BeautifulSoup

url = 'https://www.the-blockchain.com/2018/06/29/mcafee-labs-report-6x-increase-in-crypto-mining-malware-incidents-in-q1-2018/'

res = requests.get(url,headers={"User-Agent":"defined"})
soup = BeautifulSoup(res.text, 'lxml')
paragraph = [p.text for p in soup.select('.td-post-content p span')]
print(paragraph)

我无法控制架构,因此只能使用给出的内容。本质上,有效负载分为各种检查,但是每个检查都有唯一的名称(样本有效负载中的abc123,xzy7892和foobar1012387)。

帐户,帐户ID和支票键可直接从数据框中选择。

{
    "account": {
       "accountID":  "test_account",
        "name": "abc123",
        "checks": {
            "abc123": {
                "check1":  "pass",
                "check2": "fail",
                "check3": 0
               },
            "xzy7892": {
                "check1":  "pass",
                "check2": "fail",
                "check3": 0,
                "result": { 
                    "item1": 1,
                    "item2": 2
                }
            },
            "foobar11012387": {
                "check1":  "fail",
                "check2": "pass",
                "check3": 0,
                "result": {
                    "item1": 1,
                    "item2": 2
                    }
                }
            }
        }
}

但是我可以做得更多(例如account.checks.abc123.check1)。最终,我想将这三个检查合理化到数据框中它们各自的行中,但是由于检查键更改,因此我不确定该如何进行。

df2.select(['account.accountID', 'account.checks']).show()
+------------+--------------------+
|   accountID|              checks|
+------------+--------------------+
|test_account|[[pass, fail, 0],...|
+------------+--------------------+

我希望DF看起来与上表类似(我没有扩展结果,但是我可以进一步)。我不知道测试的名称(例如abc123,xzy7892),而且它们确实会更改,因此也许我需要先构建一个数组。

有什么想法吗?

1 个答案:

答案 0 :(得分:0)

如果您输入的dataframeschema如下

+-------------------------------------------------------------------------------------------+
|account                                                                                    |
+-------------------------------------------------------------------------------------------+
|[test_account, [[pass, fail, 0], [fail, pass, 0, [1, 2]], [pass, fail, 0, [1, 2]]], abc123]|
+-------------------------------------------------------------------------------------------+

root
 |-- account: struct (nullable = true)
 |    |-- accountID: string (nullable = true)
 |    |-- checks: struct (nullable = true)
 |    |    |-- abc123: struct (nullable = true)
 |    |    |    |-- check1: string (nullable = true)
 |    |    |    |-- check2: string (nullable = true)
 |    |    |    |-- check3: long (nullable = true)
 |    |    |-- foobar11012387: struct (nullable = true)
 |    |    |    |-- check1: string (nullable = true)
 |    |    |    |-- check2: string (nullable = true)
 |    |    |    |-- check3: long (nullable = true)
 |    |    |    |-- result: struct (nullable = true)
 |    |    |    |    |-- item1: long (nullable = true)
 |    |    |    |    |-- item2: long (nullable = true)
 |    |    |-- xzy7892: struct (nullable = true)
 |    |    |    |-- check1: string (nullable = true)
 |    |    |    |-- check2: string (nullable = true)
 |    |    |    |-- check3: long (nullable = true)
 |    |    |    |-- result: struct (nullable = true)
 |    |    |    |    |-- item1: long (nullable = true)
 |    |    |    |    |-- item2: long (nullable = true)
 |    |-- name: string (nullable = true)

您可以按以下方式使用struct arrayexplode函数来获得所需的输出

checks1 = ['abc123', 'foobar11012387', 'xzy7892']
checks2 = ['check1', 'check2', 'check3']

from pyspark.sql import functions as f
df.select(f.col('account.accountID'), f.explode(f.array(*[f.struct([f.col('account.checks.'+y+'.'+x).cast('string').alias(y) for y in checks1]).alias(x) for x in checks2])).alias('temp'))\
    .select(f.col('accountID'), f.col('temp.*'))\
    .show(truncate=False)

应该给您

+------------+------+--------------+-------+
|accountID   |abc123|foobar11012387|xzy7892|
+------------+------+--------------+-------+
|test_account|pass  |fail          |pass   |
|test_account|fail  |pass          |fail   |
|test_account|0     |0             |0      |
+------------+------+--------------+-------+

我希望答案会有所帮助