Question

DataFrames对我来说是一个相当新的概念。一些消息来源建议它在RDD上以及在许多情况下它如何优于RDD。我想看看DataFrames是否是一个可行的选项（最终，我将处理字节数组），所以我比较了1GB文件上字数统计应用程序的性能。

无论如何，当我运行程序时，我得到了以下结果：

RDD总计：137733312经过时间：44.5675378 s

DF总数：137733312经过时间：69.201253448 s

我期待DataFrames的执行速度比RDD快。这是执行不良的结果吗？或者，因为DataFrame实现称为textFile，所以数据被加载到RDD中，然后转换为DataFrame。这会影响性能吗？是否建议将我的文件转换为Parquet文件（因为它是默认数据源）并直接从中加载？

我想知道是否有人可以解释为什么RDD的表现优于DataFrames。

def testDF(sc: SparkContext, sqlContext: SQLContext,
     fname: String, threshold: Int): Long = {
     import sqlContext.implicits._
     val linesDF = sc.textFile(fname).toDF("line")
     val tokenizer = new Tokenizer().setInputCol("line").setOutputCol("words")
     val wordsDF = tokenizer.transform(linesDF)
     val countUDF = udf((data: WrappedArray[String]) => data.size)
     val countTotal = wordsDF.withColumn("count", countUDF('words)).agg(sum("count"))

    countTotal.first()(0).asInstanceOf[Long]
}

def testRDD(sc: SparkContext, fname: String): Int = {
    // split each document into words
    val tokenized = sc.textFile(fname).flatMap(_.split(" "))

    // count the occurrence of each word
    val wordCounts = tokenized.map((_, 1)).reduceByKey(_ + _)

    // count characters
    val countTotal: Int = wordCounts.map(_._2).reduce((a,b) => a + b)

    countTotal
}

RDD与DataFrame性能

0 个答案: