如何使用PySpark和结构化流正确解析Kafka流并从数据中删除None

时间:2019-01-04 06:09:55

标签: pyspark streaming

当我使用PySpark2.4结构化流从Kafka解析数据时,我对from_json模块中的函数pyspark.sql.functions有疑问。

from pyspark.sql.functions import *
from pyspark.sql.types import *

schema = StructType([
    StructField("id", StringType(), True),
    StructField("mobile", StringType(), True),
    StructField("email", StringType()),
    StructField("created_time", TimestampType(), True),
    StructField("created_ip", StringType(), True),
])

data = {
    "id": "11111",
    "mobile": "18212341234",
    "created_time": '2019-01-03 15:40:27',
    "created_ip": "11.122.68.106",
}

data_list = [(1, str(data))]
df = spark.createDataFrame(data_list, ("key", "value"))
df.select(from_json("value", schema).alias("json")).collect()
  

[Row(json = Row(id ='11111',mobile ='18212341234',email = None,created_time = datetime.datetime(2019,1,3,15,40,27),created_ip = '11 .122。 68.106'))]

以上代码正确无误,可以正常工作。但是背后的代码让我有些困惑。

from pyspark.sql.functions import *
from pyspark.sql.types import *

schema = StructType([
    StructField("id", StringType(), True),
    StructField("mobile", StringType(), True),
    StructField("email", StringType()),
    StructField("created_time", TimestampType(), True),
    StructField("created_ip", StringType(), True),
])

data = {
    "id": "11111",
    "mobile": "18212341234",
    "email": None,
    "created_time": '2019-01-03 15:40:27',
    "created_ip": "11.122.68.106",
}

data_list = [(1, str(data))]
df = spark.createDataFrame(data_list, ("key", "value"))
df.select(from_json("value", schema).alias("json")).collect()
  

[Row(json = None)]

我仅在数据字典中添加"email": None,并且from_json函数无法将数据正确解析为DataFrame。因为我直接从Kafka读取了这些数据,所以我不知道如何首先处理这些数据。我应该先删除数据中的None值,还是可以使用其他一些功能正确解析数据?

你能帮我吗?非常感谢。

0 个答案:

没有答案