Question

我尝试使用提供的电信流失数据集来学习流数据并对其进行操作here。我已经编写了一种方法来批量计算：

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD, LogisticRegressionWithLBFGS, LogisticRegressionModel, NaiveBayes, NaiveBayesModel}
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
object batchChurn{
   def main(args: Array[String]): Unit = {
    //setting spark context
    val conf = new SparkConf().setAppName("churn")
    val sc = new SparkContext(conf)
    //loading and mapping data into RDD
    val csv = sc.textFile("file://filename.csv")
    val data = csv.map {line =>
    val parts = line.split(",").map(_.trim)
    val stringvec = Array(parts(1)) ++ parts.slice(4,20)
    val label = parts(20).toDouble
    val vec = stringvec.map(_.toDouble)
    LabeledPoint(label, Vectors.dense(vec))
    }
    val splits = data.randomSplit(Array(0.7,0.3))
    val (training, testing) = (splits(0),splits(1))
    val numClasses = 2
    val categoricalFeaturesInfo = Map[Int, Int]()
    val numTrees = 6
    val featureSubsetStrategy = "auto"
    val impurity = "gini"
    val maxDepth = 7
    val maxBins = 32
    val model = RandomForest.trainClassifier(training, numClasses, categoricalFeaturesInfo,numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
    val labelAndPreds = testing.map {point =>
        val prediction = model.predict(point.features)
        (point.label, prediction)
}
}
}

我对此没有任何问题。现在，我查看了spark网站上提供的NetworkWordCount示例，并稍微更改了代码以了解它的行为方式。

val ssc = new StreamingContext(sc, Seconds(5))

val lines = ssc.socketTextStream("127.0.0.1", 9999)

val data = lines.flatMap(_.split(","))

我的问题是：是否可以将此DStream转换为可以输入到我的分析代码中的数组？目前，当我尝试使用val data = lines.flatMap(_.split(","))转换为数组时，它清楚地表明：error: value toArray is not a member of org.apache.spark.streaming.dstream.DStream[String]

Answer 1

您的DStream包含许多RDD，您可以使用foreachRDD函数访问RDD。

https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/streaming/dstream/DStream.html#foreachRDD(scala.Function1)

然后可以使用collect函数将每个RDD转换为数组。

这已经在这里显示了

For each RDD in a DStream how do I convert this to an array or some other typical Java data type?

Answer 2

DStream.foreachRDD为每个区间提供一个RDD [String] 当然，你可以收集一个数组

  val arr = new ArrayBuffer[String]();
   data.foreachRDD {
    arr ++= _.collect() 

}

另外请记住，由于DStream可能很大，因此最终可能会在驱动程序中获得比您想要的更多的数据。

要限制分析数据，我会这样做

data.slice(new Time(fromMillis), new Time(toMillis)).flatMap(_.collect()).toSet

Answer 3

您不能将DStream的所有元素放在数组中，因为这些元素将继续通过线路读取，并且您的数组必须无限期地可扩展。

这种决策树模型适应流媒体模式，其中训练和测试数据连续到达，对于算法原因来说并不是微不足道的 - 虽然提到收集的答案在技术上是正确的，但它们并不是适当的解决方案。你正试图这样做。

如果您想在Spark中使用Stream运行决策树，您可能需要查看Hoeffding trees。

我可以将传入的数据流转换为数组吗？

3 个答案: