如何将org.apache.spark.sql.DataFrame转换为org.apache.spark.rdd.RDD [Double]?

时间:2018-10-03 17:54:11

标签: apache-spark apache-spark-sql

我正在尝试将此想法https://fullstackml.com/how-to-check-hypotheses-with-bootstrap-and-apache-spark-cd750775286a应用于我拥有的数据框。我正在使用的代码是这一部分:

import scala.util.Sorting.quickSort

def getConfInterval(input: org.apache.spark.rdd.RDD[Double], N: Int, left: Double, right:Double)
            : (Double, Double) = {
    // Simulate by sampling and calculating averages for each of subsamples
    val hist = Array.fill(N){0.0}
    for (i <- 0 to N-1) {
        hist(i) = input.sample(withReplacement = true, fraction = 1.0).mean
    }

    // Sort the averages and calculate quantiles
    quickSort(hist)
    val left_quantile  = hist((N*left).toInt)
    val right_quantile = hist((N*right).toInt)
    return (left_quantile, right_quantile)
}

运行正常,但是当我尝试将其应用于:

val data = mydf.map( _.toDouble )

val (left_qt, right_qt) = getConfInterval(data, 1000, 0.025, 0.975)

val H0_mean = 30
if (left_qt < H0_mean && H0_mean < right_qt) {
    println("We failed to reject H0. It seems like H0 is correct.")
} else {
    println("We rejected H0")
}

我遇到错误

  

错误:值toDouble不是org.apache.spark.sql.Row val的成员   数据= dfTeste.map(_.toDouble)

在没有

的情况下
.map( _.toDouble

我得到:

  

笔记本:4:错误:类型不匹配;发现:   org.apache.spark.sql.DataFrame       (扩展为)org.apache.spark.sql.Dataset [org.apache.spark.sql.Row]必需:   org.apache.spark.rdd.RDD [Double]

mydf基本上是一个数据框,我只选择了一列(其类型为double,几行为0.0或1.0)

当我这样做时:

dfTeste.map(x=>x.toString()).rdd

它成功转换为org.apache.spark.rdd.RDD [String],但是我找不到为Double执行此操作的方法。我对此很陌生,因此如果没有太大的道歉,我深表歉意。

1 个答案:

答案 0 :(得分:0)

显然val data = mydf.map( _.toDouble )不是RDD[Double]而是DataFrame

在您链接的示例中,他们使用了

val data = dataWithHeader.filter( _ != header ).map( _.toDouble )

RDD[Double]sc.textFile返回RDD)

因此,您需要将mydf转换为RDD,可以使用例如:

val data = mydf.map(r => r.getDouble(0)).rdd