我对pyspark和数据框架世界是陌生的。当前,我有一个json文件,需要使用现有json的信息读取并创建一个嵌套的json。我正在使用SQL将单个表创建到数据帧中,并将它们加入到表中以创建最终输出。问题是我不确定通过创建名为Contact的新列,然后再次为ContactPhone,ContactEmail,ContactAddress创建列表来创建嵌套结构的方法。下面是代码-
输入:
{
"id": "1"
"surname": "xyz",
"name": "abc",
"phonetype": "mobile",
"locationcode": "IND",
"areacitycode": 091,
"phonenumber": "1234567890",
"email": "abc@xyz.com"
"address": "1234 STREET NAME, CITY, COUNTRY, ZIP"
}
输出:
{
"id":"1",
"lastName":"xyz",
"firstName":"abc",
"contact":[{
"contactPhone":[{
"type":"home",
"useType":"phone",
"cityCode":684,
"phone":"68567705",
"text":"",
"locationCode":"IND"
}],
"contactEmail":[{
"emailType":"office",
"emailId":"abc@xyz.com"
}],
"contactAddress":[{
"streetNo":"1234",
"streetName":"STREET NAME",
"city":"city name",
"country":"country",
"zipCode":"zip"
}]
}]
}
import sys
import logging
import json
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import functions as F
if __name__ == "__main__":
logging.getLogger("py4j").setLevel(logging.DEBUG)
sc = SparkSession \
.builder \
.config(conf=SparkConf()) \
.appName("pyspark: jsonglean") \
.getOrCreate()
inp_path = "\input.json"
#reads multi-line json file
inp_df = sc.read.option("multiline", "true").json(inp_path)
#print(type(inp_df))
inp_df.printSchema()
#CREATE A TABLE OUT OF DATAFRAME / Registering DataFrames as Views via SQL
inp_df.createOrReplaceTempView("customer_table")
# ONLY CUSTOMER INFORMATION IN DF
user_sql_df = sc.sql("SELECT id AS ID, \
surname AS LASTNAME, \
name AS FIRSTNAME, \
customer_table")
#print("RESPONSE TYPE:",type(user_sql_df))
print("EXTRACTION OF CUSTOMER DATA")
user_sql_df.show()
#LETS GET ONLY PHONE INFORMATION IN DF
user_phone_sql_df = sc.sql("SELECT id AS ID, \
usetype, citycode, phone, location \
FROM customer_table")
user_phone_sql_df.show()
user_phone_sql_df.printSchema()
nameSchema = StructType([
StructField("id", StringType(), True),
StructField("firstName", StringType(), True),
StructField("lastName", StringType(), True),
StructField("contactPhone", StringType(), True),
])
我计划创建一个架构,但是不确定是否会有所帮助,因为我没有这种格式的输入数据。请提供帮助或指导。谢谢。