当我使用PySpark2.4结构化流从Kafka解析数据时,我对from_json
模块中的函数pyspark.sql.functions
有疑问。
from pyspark.sql.functions import *
from pyspark.sql.types import *
schema = StructType([
StructField("id", StringType(), True),
StructField("mobile", StringType(), True),
StructField("email", StringType()),
StructField("created_time", TimestampType(), True),
StructField("created_ip", StringType(), True),
])
data = {
"id": "11111",
"mobile": "18212341234",
"created_time": '2019-01-03 15:40:27',
"created_ip": "11.122.68.106",
}
data_list = [(1, str(data))]
df = spark.createDataFrame(data_list, ("key", "value"))
df.select(from_json("value", schema).alias("json")).collect()
[Row(json = Row(id ='11111',mobile ='18212341234',email = None,created_time = datetime.datetime(2019,1,3,15,40,27),created_ip = '11 .122。 68.106'))]
以上代码正确无误,可以正常工作。但是背后的代码让我有些困惑。
from pyspark.sql.functions import *
from pyspark.sql.types import *
schema = StructType([
StructField("id", StringType(), True),
StructField("mobile", StringType(), True),
StructField("email", StringType()),
StructField("created_time", TimestampType(), True),
StructField("created_ip", StringType(), True),
])
data = {
"id": "11111",
"mobile": "18212341234",
"email": None,
"created_time": '2019-01-03 15:40:27',
"created_ip": "11.122.68.106",
}
data_list = [(1, str(data))]
df = spark.createDataFrame(data_list, ("key", "value"))
df.select(from_json("value", schema).alias("json")).collect()
[Row(json = None)]
我仅在数据字典中添加"email": None
,并且from_json函数无法将数据正确解析为DataFrame。因为我直接从Kafka读取了这些数据,所以我不知道如何首先处理这些数据。我应该先删除数据中的None值,还是可以使用其他一些功能正确解析数据?
你能帮我吗?非常感谢。