我正在尝试将此想法https://fullstackml.com/how-to-check-hypotheses-with-bootstrap-and-apache-spark-cd750775286a应用于我拥有的数据框。我正在使用的代码是这一部分:
import scala.util.Sorting.quickSort
def getConfInterval(input: org.apache.spark.rdd.RDD[Double], N: Int, left: Double, right:Double)
: (Double, Double) = {
// Simulate by sampling and calculating averages for each of subsamples
val hist = Array.fill(N){0.0}
for (i <- 0 to N-1) {
hist(i) = input.sample(withReplacement = true, fraction = 1.0).mean
}
// Sort the averages and calculate quantiles
quickSort(hist)
val left_quantile = hist((N*left).toInt)
val right_quantile = hist((N*right).toInt)
return (left_quantile, right_quantile)
}
运行正常,但是当我尝试将其应用于:
val data = mydf.map( _.toDouble )
val (left_qt, right_qt) = getConfInterval(data, 1000, 0.025, 0.975)
val H0_mean = 30
if (left_qt < H0_mean && H0_mean < right_qt) {
println("We failed to reject H0. It seems like H0 is correct.")
} else {
println("We rejected H0")
}
我遇到错误
错误:值toDouble不是org.apache.spark.sql.Row val的成员 数据= dfTeste.map(_.toDouble)
在没有
的情况下.map( _.toDouble
我得到:
笔记本:4:错误:类型不匹配;发现: org.apache.spark.sql.DataFrame (扩展为)org.apache.spark.sql.Dataset [org.apache.spark.sql.Row]必需: org.apache.spark.rdd.RDD [Double]
mydf基本上是一个数据框,我只选择了一列(其类型为double,几行为0.0或1.0)
当我这样做时:
dfTeste.map(x=>x.toString()).rdd
它成功转换为org.apache.spark.rdd.RDD [String],但是我找不到为Double执行此操作的方法。我对此很陌生,因此如果没有太大的道歉,我深表歉意。
答案 0 :(得分:0)
显然val data = mydf.map( _.toDouble )
不是RDD[Double]
而是DataFrame
在您链接的示例中,他们使用了
val data = dataWithHeader.filter( _ != header ).map( _.toDouble )
是RDD[Double]
(sc.textFile
返回RDD)
因此,您需要将mydf
转换为RDD,可以使用例如:
val data = mydf.map(r => r.getDouble(0)).rdd