如何将Kafka读取的结构化流数据保存为数据帧并对其进行解析?

时间:2019-04-03 04:24:48

标签: apache-spark-sql spark-structured-streaming

我正在尝试通过Spark结构化流从Kafka主题中读取实时流数据,但是我的理解是我需要在某个时间停止流,以便可以将解析逻辑应用于该流并将其推送到MongoDB。有没有一种方法可以在有/没有停止流的情况下将流数据保存到单独的数据框中?

我检查了指南和其他博客,但没有得到我要求的直接答案

val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:9092, host:9092, host:9092")
.option("subscribe", "TOPIC_P2_R2, TOPIC_WITH_COMP_P2_R2.DIT, TOPIC_WITHOUT_COMP_P2_R2.DIT")
.option("startingOffsets", "earliest")
.load()

val dfs = df.selectExpr("CAST(value AS STRING)")

val consoleOutput = dfs.writeStream
.outputMode("append")
.format("console")
.start()
consoleOutput.awaitTermination()
consoleOutput.stop()

我需要通过停止流传输或不停止流以某种方式将流传输数据保存在数据帧中

下面是我所具有的解析逻辑,而不是从文件路径中选择数据集,我需要将流数据作为我的新数据集,并且应该能够应用我的其余逻辑并获得输出。现在将其保存到Mongo并不是我的主要重点;

val log = spark.read.format("csv")
  .option("inferSchema", "true")
  .option("header", "true")
  .load("C:\\Users\\raheem_mohammed\\IdeaProjects\\diag.csv")
log.createOrReplaceTempView("logs")
val df = spark.sql("select _raw, _time from logs").toDF

//Adds Id number to each of the event
val logs = dfs.withColumn("Id", monotonicallyIncreasingId()+1)
//Register Dataframe as a temp table
logs.createOrReplaceTempView("logs")

val  = spark.sql("select Id, value from logs")


//Extracts columns from _raw column. Also finds the probabilities of compositeNames.
//If true then the compositeName belongs to one of the four possibilities
val extractedDF = dfss.withColumn("managed_server", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\]",2))
  .withColumn("alert_summary", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\]",3))
  .withColumn("oracle_details", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\]",5))
  .withColumn("ecid", regexp_extract($"_raw", "(?<=ecid: )(.*?)(?=,)",1))
  //.withColumn("CompName",regexp_extract($"_raw",""".*(composite_name|compositename|composites|componentDN):\s+(\S+)\]""",2))
  .withColumn("CompName",regexp_extract($"_raw",""".*(composite_name|compositename|composites|componentDN):\s+([a-zA-Z]+)""",2))
  .withColumn("composite_name", col("_raw").contains("composite_name"))
  .withColumn("compositename", col("_raw").contains("compositename"))
  .withColumn("composites", col("_raw").contains("composites"))
  .withColumn("componentDN", col("_raw").contains("componentDN"))

//Filters out any NULL values if found
val finalData = extractedDF.filter(
  col("managed_server").isNotNull &&
    col("alert_summary").isNotNull &&
    col("oracle_details").isNotNull &&
    col("ecid").isNotNull &&
    col("CompName").isNotNull &&
    col("composite_name").isNotNull &&
    col("compositename").isNotNull &&
    col("composites").isNotNull &&
    col("componentDN").isNotNull)

    finalData.show(false)

0 个答案:

没有答案