我有一个文件夹,用不同的txt文件填充流媒体方式。我编写了一个代码,提取了一些IP信息,然后将它们放在数据框中。 (如果我在非流模式下使用它,它工作正常)问题是,当我执行我的代码时,所有输出都是空的!
[![在此处输入图像说明] [1]] [1]
这是我的代码:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import scala.util.Try
/**
* Created by saeedtkh on 5/24/17.
*/
object Main_ML_without_Streaming {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("saeed_test").setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(5))
conf.set("spark.driver.allowMultipleContexts", "true")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
/////////////////////Start extract the packet
val customSchema = StructType(Array(
StructField("column0", StringType, true),
StructField("column1", StringType, true),
StructField("column2", StringType, true)))
val DStream = ssc.textFileStream("/Users/saeedtkh/Desktop/sharedsaeed/Test")
val Row_DStream =DStream.map(line => line.split(">")).map(array => {
val first = Try(array(0).trim.split(" ")(0)) getOrElse ""
val second = Try(array(1).trim.split(" ")(6)) getOrElse ""
val third = Try(array(2).trim.split(" ")(0).replace(":", "")) getOrElse ""
Row.fromSeq(Seq(first, second, third))
})
Row_DStream.foreachRDD { DStream =>
val dataFrame_trainingData = sqlContext.createDataFrame(DStream, customSchema)
dataFrame_trainingData.groupBy("column1", "column2").count().show()
/////////////////////end extract the packet
val numFeatures = 3
val model = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.zeros(numFeatures))
/////////////////////////////////////////////////////Model made
}
ssc.start()
ssc.awaitTermination()
print("Here is the anwser: *****########*********#########*******222")
}
}
以下是文件内容:
07:30:42.415558 IP 10.0.0.3.53890 > 10.0.0.1.5001: Flags [.], seq 163564797:163629957, ack 1082992383, win 58, options [nop,nop,TS val 9762853 ecr 9762853], length 65160
07:30:42.415558 IP 10.0.0.3.53890 > 10.0.0.1.5001: Flags [.], seq 65160:130320, ack 1, win 58, options [nop,nop,TS val 9762853 ecr 9762853], length 65160
07:30:42.415558 IP 10.0.0.3.53890 > 10.0.0.1.5001: Flags [.], seq 130320:178104, ack 1, win 58, options [nop,nop,TS val 9762853 ecr 9762853], length 47784
07:30:42.415660 IP 10.0.0.1.5001 > 10.0.0.3.53890: Flags [.], ack 178104, win 1, options [nop,nop,TS val 9762853 ecr 9762853], length 0
07:30:42.415708 IP 10.0.0.1.5001 > 10.0.0.3.53890: Flags [.], ack 178104, win 1051, options [nop,nop,TS val 9762853 ecr 9762853], length 0
07:30:42.415715 IP 10.0.0.3.53890 > 10.0.0.1.5001: Flags [.], seq 178104:195480, ack 1, win 58, options [nop,nop,TS val 9762853 ecr 9762853], length 17376
07:30:42.415716 IP 10.0.0.3.53890 > 10.0.0.1.5001: Flags [.], seq 195480:260640, ack 1, win 58, options [nop,nop,TS val 9762853 ecr 9762853], length 65160
07:30:42.415716 IP 10.0.0.3.53890 > 10.0.0.1.5001: Flags [.], seq 260640:325800, ack 1, win 58, options [nop,nop,TS val 9762853 ecr 9762853], length 65160
UPDATE1: 根据回答1我有一些错误: [![在此处输入图片说明] [2]] [2] 我还要提一下,每次我在文件夹中添加新的文本文件进行处理。所以那里没有旧文件。
你能告诉我问题在哪里吗?
答案 0 :(得分:0)
经过测试,我确认我的怀疑在评论中暗示。该问题涉及同一JVM上的多个Spark上下文(见下文)。
初始化Spark的正确方法,假设Spark版本> 2.0(在本例中为2.1.0),正在使用SparkSessionBuilder,如下所示:
val session = SparkSession.builder.
master("local[*]")
.appName("maasg_test")
.getOrCreate()
然后我们可以使用基础StreamingContext
创建SparkContext
,如下所示:
val ssc = new StreamingContext(session.sparkContext, Seconds(5))
然后程序正常运行。
使用问题中提供的配置时,在场景后面有2 sparkContext
被初始化。在此次调用中隐含创建的一个:
val ssc = new StreamingContext(conf, Seconds(5))
当我们明确创建它时,第二个:
val sc = new SparkContext(conf)
转换为使用第二个SQLContext
创建的SparkContext
无法查看StreamingContext
生成的数据,该数据附加了无关SparkContext
。
在日志中,这会产生这种痕迹:
[org.apache.spark.executor.Executor] Exception in task 4.0 in stage 46.0 (TID 209). Stacktrace:
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:313)
at scala.None$.get(Option.scala:311)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:670)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:289)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
最后一句话:配置设置:spark.driver.allowMultipleContexts
在特定情况下只能设置为true
,通常是并行运行单独的作业。