我熟悉Spark和Scala,目前的任务是“求和”这两个数据帧:
+---+--------+-------------------+
|cyl|avg(mpg)| var_samp(mpg)|
+---+--------+-------------------+
| 8| 15.8| 1.0200000000000014|
| 6| 20.9|0.48999999999999966|
| 4| 33.9| 0.0|
+---+--------+-------------------+
+---+------------------+------------------+
|cyl| avg(mpg)| var_samp(mpg)|
+---+------------------+------------------+
| 8| 13.75| 6.746999999999998|
| 6| 21.4| NaN|
+---+------------------+------------------+
在这种情况下,“键”为cyl
,“值”为avg(mpg)
和var_samp(mpg)
。
这两个的(近似)结果为:
+---+--------+-------------------+
|cyl|avg(mpg)| var_samp(mpg)|
+---+--------+-------------------+
| 8| 29.55| 7.76712|
| 6| 42.3|0.48999999999999966|
| 4| 33.9| 0.0|
+---+--------+-------------------+
注意如何将NaN
视为零,以及某些DataFrame中可能缺少“键”(第二个键中缺少4个键)。
我怀疑reduceByKey
是到达这里的方法,但无法正常工作。
到目前为止,这是我的代码:
case class Cars(car: String, mpg: String, cyl: String, disp: String, hp: String,
drat: String, wt: String, qsec: String, vs: String, am: String, gear: String, carb: String)
object Bootstrapping extends App {
override def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark and SparkSql").setMaster("local")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
// Exploring SparkSQL
// Initialize an SQLContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
import sqlContext.implicits._
// Load a cvs file
val csv = sc.textFile("mtcars.csv")
// Create a Spark DataFrame
val headerAndRows = csv.map(line => line.split(",").map(_.trim))
val header = headerAndRows.first
val mtcdata = headerAndRows.filter(_(0) != header(0))
val mtcars = mtcdata
.map(p => Cars(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7), p(8), p(9), p(10), p(11)))
.toDF
// Aggregate data after grouping by columns
import org.apache.spark.sql.functions._
mtcars.sort($"cyl").show()
mtcars.groupBy("cyl").agg(avg("mpg"), var_samp("mpg")).sort($"cyl").show()
//sample 25% of the population without replacement
val sampledData = mtcars.sample(false, 0.25)
//bootstrapping loop
for (a <- 1 to 5) {
//get bootstrap sample
val bootstrapSample = sampledData.sample(true, 1)
//HERE!!! I WANT TO SAVE THE AGGREGATED SUM OF THE FOLLOWING:
bootstrapSample.groupBy("cyl").agg(avg("mpg"), var_samp("mpg"))
}
}
}
这是我正在使用的数据:Motor Trend Car Road Tests
答案 0 :(得分:0)
一种方法是union
两个数据框,使用when/otherwise
转换NaN
,并执行groupBy
以汇总列的sum
,如下所示:
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = Seq(
(8, 15.8, 1.0200000000000014),
(6, 20.9, 0.48999999999999966),
(4, 33.9, 0.0)
).toDF("cyl", "avg_mpg", "var_samp_mpg")
val df2 = Seq(
(8, 13.75, 6.746999999999998),
(6, 21.4, Double.NaN)
).toDF("cyl", "avg_mpg", "var_samp_mpg")
(df1 union df2).
withColumn("var_samp_mpg", when($"var_samp_mpg".isNaN, 0.0).otherwise($"var_samp_mpg")).
groupBy("cyl").agg(sum("avg_mpg"), sum("var_samp_mpg")).
show
// +---+------------+-------------------+
// |cyl|sum(avg_mpg)| sum(var_samp_mpg)|
// +---+------------+-------------------+
// | 6| 42.3|0.48999999999999966|
// | 4| 33.9| 0.0|
// | 8| 29.55| 7.7669999999999995|
// +---+------------+-------------------+