笛卡尔的DStream

时间:2015-03-13 14:28:54

标签: apache-spark dstream

我使用Spark笛卡儿函数来生成N对值的列表。

然后我映射这些值以生成每个用户之间的距离度量:

val cartesianUsers: org.apache.spark.rdd.RDD[(distance.classes.User, distance.classes.User)] = users.cartesian(users)
cartesianUsers.map(m => manDistance(m._1, m._2))

这可以按预期工作。

使用Spark Streaming库我创建一个DStream然后映射它:

val customReceiverStream: ReceiverInputDStream[String] = ssc.receiverStream....
customReceiverStream.foreachRDD(m => {
  println("size is " + m)
})

我可以在customReceiverStream.foreachRDD中使用笛卡尔函数,但根据文档http://spark.apache.org/docs/1.2.0/streaming-programming-guide.htm,这不是它的预期用途:

foreachRDD(func)应用函数的最通用输出运算符func, to each RDD generated from the stream. This function should push the data in each RDD to a external system, like saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

如何计算DStream的笛卡儿?也许我误解了DStreams的使用?

1 个答案:

答案 0 :(得分:1)

我不知道变换方法:

cartesianUsers.transform(car => car.cartesian(car))

很好的谈话,也提到转换功能在大约17:00 https://www.youtube.com/watch?v=g171ndOHgJ0