分层抽样:在Scala中为sampleBy方法形成分数图

时间:2018-06-07 15:58:12

标签: scala apache-spark dataframe apache-spark-sql

我有一个列数据框newDf,如下所示:

+------------+
|       value|
+------------+
|5TEJU62N58Z4|
|000000000000|
|1J4GW48SX4C3|
|1J4GW68S2XC7|
|1J4GK48K04W1|

它有486行。我想对这个数据帧进行分层抽样。为此,我首先需要创建一个分数映射,然后将其作为sampleBy方法中的参数传递。这就是我想要的:

val fractions = newDf.distinct.map(x => (x,0.8)).collect().toMap
val sampled_df = newDf.stat.sampleBy("value", fractions, 10L)

但它错误地说:

Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row
- field (class: "org.apache.spark.sql.Row", name: "_1")
- root class: "scala.Tuple2"

我也试过准备这样的分数:

val fractions = newDf.map(_._1).distinct.map(x => (x,0.8)).collectAsMap()

但它显示我说错误

Error:(32, 33) value _1 is not a member of org.apache.spark.sql.Row
    val fractions = newDf.map(_._1).distinct.map(x => (x,0.8)).collectAsMap()

如何准备此分数图,以便我可以在下面的sampleBy方法中使用它并进行抽样?

1 个答案:

答案 0 :(得分:1)

如何简单

newDf.distinct.as[String].collect.map((_, 0.8)).toMap