当我以流模式打印数据帧时,我的输出为空

时间:2017-05-29 12:40:27

标签: file spark-streaming spark-dataframe

我有一个文件夹,用不同的txt文件填充流媒体方式。我编写了一个代码,提取了一些IP信息,然后将它们放在数据框中。 (如果我在非流模式下使用它,它工作正常)问题是,当我执行我的代码时,所有输出都是空的!

[![在此处输入图像说明] [1]] [1]

这是我的代码:

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

import scala.util.Try
/**
  * Created by saeedtkh on 5/24/17.
  */
object Main_ML_without_Streaming {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("saeed_test").setMaster("local[*]")
    val ssc = new StreamingContext(conf, Seconds(5))
    conf.set("spark.driver.allowMultipleContexts", "true")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)

    /////////////////////Start extract the packet
    val customSchema = StructType(Array(
      StructField("column0", StringType, true),
      StructField("column1", StringType, true),
      StructField("column2", StringType, true)))

      val DStream = ssc.textFileStream("/Users/saeedtkh/Desktop/sharedsaeed/Test")
      val Row_DStream =DStream.map(line => line.split(">")).map(array => {
      val first = Try(array(0).trim.split(" ")(0)) getOrElse ""
      val second = Try(array(1).trim.split(" ")(6)) getOrElse ""
      val third = Try(array(2).trim.split(" ")(0).replace(":", "")) getOrElse ""
      Row.fromSeq(Seq(first, second, third))
    })

    Row_DStream.foreachRDD { DStream =>
          val dataFrame_trainingData = sqlContext.createDataFrame(DStream, customSchema)
          dataFrame_trainingData.groupBy("column1", "column2").count().show()
          /////////////////////end extract the packet


          val numFeatures = 3
          val model = new StreamingLinearRegressionWithSGD()
            .setInitialWeights(Vectors.zeros(numFeatures))
          /////////////////////////////////////////////////////Model made
        }

        ssc.start()
        ssc.awaitTermination()


    print("Here is the anwser: *****########*********#########*******222")
  }
}

以下是文件内容:

07:30:42.415558 IP 10.0.0.3.53890 > 10.0.0.1.5001: Flags [.], seq 163564797:163629957, ack 1082992383, win 58, options [nop,nop,TS val 9762853 ecr 9762853], length 65160
07:30:42.415558 IP 10.0.0.3.53890 > 10.0.0.1.5001: Flags [.], seq 65160:130320, ack 1, win 58, options [nop,nop,TS val 9762853 ecr 9762853], length 65160
07:30:42.415558 IP 10.0.0.3.53890 > 10.0.0.1.5001: Flags [.], seq 130320:178104, ack 1, win 58, options [nop,nop,TS val 9762853 ecr 9762853], length 47784
07:30:42.415660 IP 10.0.0.1.5001 > 10.0.0.3.53890: Flags [.], ack 178104, win 1, options [nop,nop,TS val 9762853 ecr 9762853], length 0
07:30:42.415708 IP 10.0.0.1.5001 > 10.0.0.3.53890: Flags [.], ack 178104, win 1051, options [nop,nop,TS val 9762853 ecr 9762853], length 0
07:30:42.415715 IP 10.0.0.3.53890 > 10.0.0.1.5001: Flags [.], seq 178104:195480, ack 1, win 58, options [nop,nop,TS val 9762853 ecr 9762853], length 17376
07:30:42.415716 IP 10.0.0.3.53890 > 10.0.0.1.5001: Flags [.], seq 195480:260640, ack 1, win 58, options [nop,nop,TS val 9762853 ecr 9762853], length 65160
07:30:42.415716 IP 10.0.0.3.53890 > 10.0.0.1.5001: Flags [.], seq 260640:325800, ack 1, win 58, options [nop,nop,TS val 9762853 ecr 9762853], length 65160

UPDATE1: 根据回答1我有一些错误: [![在此处输入图片说明] [2]] [2] 我还要提一下,每次我在文件夹中添加新的文本文件进行处理。所以那里没有旧文件。

你能告诉我问题在哪里吗?

1 个答案:

答案 0 :(得分:0)

经过测试,我确认我的怀疑在评论中暗示。该问题涉及同一JVM上的多个Spark上下文(见下文)。

初始化Spark的正确方法,假设Spark版本> 2.0(在本例中为2.1.0),正在使用SparkSessionBuilder,如下所示:

val session = SparkSession.builder.
  master("local[*]")
  .appName("maasg_test")
  .getOrCreate()

然后我们可以使用基础StreamingContext创建SparkContext,如下所示:

val ssc = new StreamingContext(session.sparkContext, Seconds(5))

然后程序正常运行。

使用问题中提供的配置时,在场景后面有2 sparkContext被初始化。在此次调用中隐含创建的一个:

val ssc = new StreamingContext(conf, Seconds(5))

当我们明确创建它时,第二个:

val sc = new SparkContext(conf)

转换为使用第二个SQLContext创建的SparkContext无法查看StreamingContext生成的数据,该数据附加了无关SparkContext

在日志中,这会产生这种痕迹:

[org.apache.spark.executor.Executor] Exception in task 4.0 in stage 46.0 (TID 209). Stacktrace:
java.util.NoSuchElementException: None.get
    at scala.None$.get(Option.scala:313)
    at scala.None$.get(Option.scala:311)
    at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
    at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:670)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:289)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748) 

最后一句话:配置设置:spark.driver.allowMultipleContexts在特定情况下只能设置为true,通常是并行运行单独的作业。