Dstream运行时创建/销毁

时间:2016-06-13 22:01:20

标签: apache-spark spark-streaming

是否可以创建具有新名称的Dstream并在运行时销毁旧的dstream?

//Read the Dstream 
inputDstream = ssc.textFileStream("./myPath/")

实施例: 我正在读一个名为 cvd_filter.txt 的文件,其中每一行都包含一个字符串,该字符串应该是dstream的过滤条件。使用新值更新此文件(也可以附加):

示例: 在时间10: 00 ; cat cvd_filter.txt

"1001" "1002" "1003"

// Read cvd_filter.txt every 5 mins and do creation/destruction of dstreams.

with open(cvd_filter.txt) as f:
    content = f.readlines()
    dstream_content[0] = inputDstream.filter(lambda a: content[0] in a)

// At this point (dstream_1001 , dstream_1002, dstream_1003) should get created. 
// NOW, DO SOME OPERATION ON INDIVIDUAL dstreams. 

时间10: 05 ; cat cvd_filter.txt

"1004" "1002" "1003"

// Create dstream_1004 for new filter string, Destroy dstream_1001 only 
// but retain dstream_1002 and dstream_1003.  
At this point (dstream_1004 , dstream_1002, dstream_1003) should be present. 
// NOW, DO SOME OPERATION ON INDIVIDUAL dstreams.

1 个答案:

答案 0 :(得分:0)

NO。 DStream上的新流或操作不能添加到正在运行的上下文中。 我建议根据foreachRDD对您的用例进行建模,这样您就可以自由地对底层RDD进行任意操作。 例如:

val dstream = ??? /// original dstream
dstream.foreachRDD{rdd =>
  val filters =  // read file
  val filteredRDDs = filters.map(f => rdd.filter(elem => elem.contains(f))
  ...
}

然后在不同的过滤RDD上进一步表达您需要的操作。 DStreams将所有转换操作委托给底层RDD,因此您应该能够以这种方式表达您的业务逻辑。