用PySpark数据帧解析json字符串列表

时间:2020-09-09 16:47:33

标签: python json dataframe apache-spark pyspark

我正在尝试读取带有pyspark数据帧的JSON列表。 您将在我的输入数据下面找到,我的目标是获得一个包含两列用户(字符串)和ips Array [Sting]的数据框。

sampleJson = [ ('{"user":100, "ips" : ["191.168.192.101", "191.168.192.103", "191.168.192.96", "191.168.192.99"]}',), ('{"user":101, "ips" : ["191.168.192.102", "191.168.192.105", "191.168.192.103", "191.168.192.107"]}',), ('{"user":102, "ips" : ["191.168.192.105", "191.168.192.101", "191.168.192.105", "191.168.192.107"]}',), ('{"user":103, "ips" : ["191.168.192.96", "191.168.192.100", "191.168.192.107", "191.168.192.101"]}',), ('{"user":104, "ips" : ["191.168.192.99", "191.168.192.99", "191.168.192.102", "191.168.192.99"]}',), ('{"user":105, "ips" : ["191.168.192.99", "191.168.192.99", "191.168.192.100", "191.168.192.96"]}',), ]

谢谢您的帮助。

1 个答案:

答案 0 :(得分:0)

通过 from_json 使用 defining schema 功能。

Example:

from pyspark.sql.functions import *
from pyspark.sql.types import *

sampleJson = [ ('{"user":100, "ips" : ["191.168.192.101", "191.168.192.103", "191.168.192.96", "191.168.192.99"]}',),  ('{"user":101, "ips" : ["191.168.192.102", "191.168.192.105", "191.168.192.103", "191.168.192.107"]}',),  ('{"user":102, "ips" : ["191.168.192.105", "191.168.192.101", "191.168.192.105", "191.168.192.107"]}',),  ('{"user":103, "ips" : ["191.168.192.96", "191.168.192.100", "191.168.192.107", "191.168.192.101"]}',),  ('{"user":104, "ips" : ["191.168.192.99", "191.168.192.99", "191.168.192.102", "191.168.192.99"]}',),  ('{"user":105, "ips" : ["191.168.192.99", "191.168.192.99", "191.168.192.100", "191.168.192.96"]}',),  ]

df1=spark.createDataFrame(sampleJson)

sch=StructType([StructField('user', StringType(), False),StructField('ips',ArrayType(StringType()))])

df1.withColumn("n",from_json(col("_1"),sch)).select("n.*").show(10,False)
#+----+--------------------------------------------------------------------+
#|user|ips                                                                 |
#+----+--------------------------------------------------------------------+
#|100 |[191.168.192.101, 191.168.192.103, 191.168.192.96, 191.168.192.99]  |
#|101 |[191.168.192.102, 191.168.192.105, 191.168.192.103, 191.168.192.107]|
#|102 |[191.168.192.105, 191.168.192.101, 191.168.192.105, 191.168.192.107]|
#|103 |[191.168.192.96, 191.168.192.100, 191.168.192.107, 191.168.192.101] |
#|104 |[191.168.192.99, 191.168.192.99, 191.168.192.102, 191.168.192.99]   |
#|105 |[191.168.192.99, 191.168.192.99, 191.168.192.100, 191.168.192.96]   |
#+----+--------------------------------------------------------------------+


#schema

df1.withColumn("n",from_json(col("_1"),sch)).select("n.*").printSchema()
#root
# |-- user: string (nullable = true)
# |-- ips: array (nullable = true)
# |    |-- element: string (containsNull = true)