我有一个列数据框newDf
,如下所示:
+------------+
| value|
+------------+
|5TEJU62N58Z4|
|000000000000|
|1J4GW48SX4C3|
|1J4GW68S2XC7|
|1J4GK48K04W1|
它有486行。我想对这个数据帧进行分层抽样。为此,我首先需要创建一个分数映射,然后将其作为sampleBy
方法中的参数传递。这就是我想要的:
val fractions = newDf.distinct.map(x => (x,0.8)).collect().toMap
val sampled_df = newDf.stat.sampleBy("value", fractions, 10L)
但它错误地说:
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row
- field (class: "org.apache.spark.sql.Row", name: "_1")
- root class: "scala.Tuple2"
我也试过准备这样的分数:
val fractions = newDf.map(_._1).distinct.map(x => (x,0.8)).collectAsMap()
但它显示我说错误
Error:(32, 33) value _1 is not a member of org.apache.spark.sql.Row
val fractions = newDf.map(_._1).distinct.map(x => (x,0.8)).collectAsMap()
如何准备此分数图,以便我可以在下面的sampleBy
方法中使用它并进行抽样?
答案 0 :(得分:1)
如何简单
newDf.distinct.as[String].collect.map((_, 0.8)).toMap