使用pyspark和结构化流来正确解析kafka流(使用当前模式处理获取所有空值)

时间:2018-05-04 20:49:57

标签: python apache-spark pyspark apache-kafka

我正在使用 Spark 2.3.0 和pyspark订阅Kafka流,目前我正在尝试解析消息值,但是为每条记录获取所有空值。

我的kafka版本为kafka_2.11-1.1.0,代理商版本 0.10

我正在运行一个包含以下内容的脚本:/opt/spark/bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 ~/code/process/mta_processor.py

mta_processor.py 如下所示:

import pyspark 
from pyspark import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import bson

sc = SparkContext()
sc.setLogLevel("ERROR")
spark = SparkSession(sc)


df = spark.readStream.format("kafka") \
    .option("kafka.bootstrap.servers","localhost:9092") \
    .option("subscribe", "mta-delays") \
    .option("startingOffsets", "earliest").load()

jsonschema = StructType().add("timestamp", StringType()) \
                         .add("timestamp_unix", IntegerType()) \
                         .add("oid", StringType()) \
                         .add("lines", ArrayType(StructType() \
                             .add("line", StringType()) \
                            .add("status", StringType()) \
                            .add("raw_text", StringType())))

mta_stream = df.select(from_json(col("value") \
                                .cast("string"), jsonschema) \
                                .alias("parsed_mta_values"))

mta_data = mta_stream.select("parsed_mta_values.*")


qry = mta_data.writeStream.outputMode("append").format("console").start()
qry.awaitTermination()

但结果是所有空值:

Batch: 0
-------------------------------------------
+---------+--------------+----+-----+
|timestamp|timestamp_unix| oid|lines|
+---------+--------------+----+-----+
|     null|          null|null| null|
|     null|          null|null| null|
|     null|          null|null| null|
|     null|          null|null| null|
+---------+--------------+----+-----+

如果我只是使用mta_data = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

抓取消息

有数据,因为我得到内容的关键和价值:

-------------------------------------------
Batch: 0
-------------------------------------------
+--------------------+--------------------+
|                 key|               value|
+--------------------+--------------------+
|5aecc1faeb0155502...|"{\"timestamp_uni...|
|5aecc254eb0155512...|"{\"timestamp_uni...|
|5aecc2b0eb0155545...|"{\"timestamp_uni...|
+--------------------+--------------------+

我从kafka制作人发送给我的kafka主题的数据看起来像这样。

{ "timestamp_unix":1525465800, "lines":[ { "status":"GOOD SERVICE", "line":"123", "raw_text":null }, { "status":"GOOD SERVICE", "line":"456", "raw_text":null }, { "status":"GOOD SERVICE", "line":"7", "raw_text":null }, { "status":"PLANNED WORK", "line":"ACE", "raw_text":"\n <span class=\"TitlePlannedWork\" >Planned Work</span>\n <br/>\n <a class=\"plannedWorkDetailLink\" onclick=ShowHide(184938);><b>Rockaways Long-Term Flood Protection | Until May 18, Mon to Fri, 6 AM to 10 AM & 3:30 PM to 8 PM<br clear=left>[A] No service <i>to/from</i> Rockaway Park-Beach 116 St</a><br/><br/><div id= 184938 class=\"plannedWorkDetail\" ><br></b>[A] trains that were scheduled to operate <i>to/from</i> Rockaway Park-Beach 116 St will operate <i>to/from</i> Euclid Av instead.<br><br>[S] No Rockaway Park Shuttle service at Broad Channel.<br><br>[S] Rockaway Park Shuttle operates, <i>approximately every 20 minutes,</i> between <b>Rockaway Park-Beach 116 St</b> and <b>Beach 90 St</b> and via the [A] <i>to/from</i> <b>Far Rockaway-Mott Av</b>.<br><br>For<b> Beach 90 St, Beach 98 St, Beach 105 St</b> and <b>Rockaway Park-Beach 116 St</b>, transfer to the [S] Rockaway Park Shuttle at Beach 67 St.<br><br><a href=http://www.mta.info/press-release/nyc-transit/nyc-transit-starting-subway-flood-protection-project-rockaways-next-month target=_blank><font color=#0000FF>Click here</font></a> for additional details on this Flood Protection Project.<br><br><table class=plannedworkTableStyle border=1 cellspacing=1 cellpadding=5 rules=none frame=box><td> [ad] <td><font size=1>This service change affects one or more ADA accessible stations. Please call 511 for help with planning<br>your trip. If you are deaf or hard of hearing, use your preferred relay service provider or the free 711 relay. </font></table><br><b><br></div></b><br/>\n <br/><br/>\n " }, { "status":"DELAYS", "line":"BDFM", "raw_text":"\n <span class=\"TitleDelay\">Delays</span>\n <span class=\"DateStyle\">\n &nbsp;Posted:&nbsp;05/04/2018&nbsp; 4:29PM\n </span><br/><br/>\n [F] and [M] train service has resumed following an earlier incident involving a sick passenger at <STRONG>34 St-Herald Sq.</STRONG>\n <br/><br/>\n " }, { "status":"DELAYS", "line":"G", "raw_text":"\n <span class=\"TitleDelay\">Delays</span>\n <span class=\"DateStyle\">\n &nbsp;Posted:&nbsp;05/04/2018&nbsp; 4:01PM\n </span><br/><br/>\n Court Sq-bound [G] trains are running with delays because of signal problems at<STRONG> Broadway</STRONG>.\n <br/><br/>\n " }, { "status":"GOOD SERVICE", "line":"JZ", "raw_text":null }, { "status":"DELAYS", "line":"L", "raw_text":"\n <span class=\"TitleDelay\">Delays</span>\n <span class=\"DateStyle\">\n &nbsp;Posted:&nbsp;05/04/2018&nbsp; 4:27PM\n </span><br/><br/>\n [L] trains are running with delays in both directions because of a sick passenger at <STRONG>Canarsie-Rockaway Pkwy.</STRONG>\n <br/><br/>\n " }, { "status":"GOOD SERVICE", "line":"NQR", "raw_text":null }, { "status":"PLANNED WORK", "line":"S", "raw_text":"\n <span class=\"TitlePlannedWork\" >Planned Work</span>\n <br/>\n <a class=\"plannedWorkDetailLink\" onclick=ShowHide(184937);><b>Rockaways Long-Term Flood Protection | Until Friday May 18, 2018<br clear=left>[S] No Rockaway Park Shuttle service at Broad Channel - Take the [A] instead<br clear=left>[A] No rush hour service <i>to/from</i> Rockaway Park-Beach 116 St</a><br/><br/><div id= 184937 class=\"plannedWorkDetail\" ><br></b>[S] Rockaway Park Shuttle operates, <i>approximately every 20 minutes,</i> between <br><b>Rockaway Park-Beach 116 St</b> and <b>Beach 90 St</b> and via the [A] <i>to/from</i> <b>Far Rockaway-Mott Av</b>.<br><br>For <b>Broad Channel</b>, take the [A], transfer to the [S] Rockaway Park Shuttle at Beach 67 St.<br><br><b><i>Alternate travel note for Broad Channel:<br></i>Q52 </b>SBS, <b>Q53 </b>SBS, <b>QM16</b> and <b>QM17</b> service is also available at Cross Bay Blvd and Noel Rd.<br><br><a href=http://www.mta.info/press-release/nyc-transit/nyc-transit-starting-subway-flood-protection-project-rockaways-next-month target=_blank><font color=#0000FF>Click here</font></a> for additional details on this Flood Protection Project.<br><b><br></div></b><br/>\n <br/><br/>\n " }, { "status":"GOOD SERVICE", "line":"SIR", "raw_text":null } ], "timestamp":"5/4/2018 4:30:00 PM", "oid":"5aecc363eb015557829c87c5" }

我没有在消费者方面看到任何明显的错误消息或问题,无法说出为什么我的所有值都以空值出现。

无论如何更容易弄清楚为什么它没有正确解析值?

更新

似乎报价可能是问题的一部分。为了它的价值,我有一个从mongodb集合中提取的python字典对象,每个记录都被转储到prepared_record = json.dumps(record)的字符串,然后发送:

producer = KafkaProducer(bootstrap_servers='localhost:9092', 
                     value_serializer=lambda v: json.dumps(v).encode('utf-8'))

producer.send(MTA_DELAYS_IN_KAFKA_TOPIC, key=obj_key.encode(), value=prepared_record).get(timeout=30)

不确定是否有更好的方法来准备和发送记录。

2 个答案:

答案 0 :(得分:0)

我有同样的问题。每个字段的StringType()都不正确。 如果您的一种数据类型不正确,则所有字段都将显示为空值。 在我的情况下,unix_timestamp不是IntegerType()而是LongType()

    private void Setup(string Port)
    {
        bool ValidPort = false;
        int CloseSleep = 10;

        _PortName = Port;
        _PortType = this;

        string[] AvailablePorts = SerialPort.GetPortNames();  

        foreach(string aPort in AvailablePorts)
        {
            if (aPort == _PortName)
            {
                // The required port is listed in the list of available ports)
                ValidPort = true;
                break;
            }
        }

        if (ValidPort)
        {
            try
            {
                if (_ThePort != null)
                {
                    _ThePort.Close();
                    _ThePort.DataReceived -= ReceivedDataEventHandler;

                    while(CloseSleep-- > 0)
                        System.Threading.Thread.Sleep(100);

                    _ThePort.Dispose();
                    _ThePort = null;
                }
            }
            catch (Exception ex)
            {
                EMS_Config_Tool.ModalDialog md = new EMS_Config_Tool.ModalDialog("Closing Port: " + ex.Message, "System Exception");
                md.ShowDialog();
            }

            System.IO.Ports.SerialPort TheNewPort = new System.IO.Ports.SerialPort(Port, 38400);

            // Setup the event handlers from Tx and Rx
            Handler.DataOutEvent    += CommsSender;
            TheNewPort.DataReceived += ReceivedDataEventHandler;

            TheNewPort.DataBits  = 8;
            TheNewPort.Parity    = Parity.None;
            TheNewPort.Handshake = System.IO.Ports.Handshake.None;
            TheNewPort.StopBits  = System.IO.Ports.StopBits.One;

            // We will try 3 times to open the port, and report an error if we fail to open the port
            try
            {
                TheNewPort.Open();
            }
            catch (Exception)
            {
                System.Threading.Thread.Sleep(1000);

                try
                {
                    TheNewPort.Open();
                }
                catch (Exception)
                {
                    System.Threading.Thread.Sleep(1000);

                    try
                    {
                        TheNewPort.Open();
                    }
                    catch (Exception ex)
                    {
                        EMS_Config_Tool.ModalDialog md = new EMS_Config_Tool.ModalDialog("Opening Port: " + ex.Message, "System Exception");

                        return;
                    }
                }
            }

这应该可以解决您的问题。

答案 1 :(得分:-1)

尝试对jsonschema中的所有列使用StringType()。对我来说,它更改为StringType()

时有效
jsonschema = StructType().add("timestamp", StringType()) \
                     .add("timestamp_unix", StringType()) \
                     .add("oid", StringType()) \
                     .add("lines", StringType(StructType() \
                         .add("line", StringType()) \
                        .add("status", StringType()) \
                        .add("raw_text", StringType())))