Question

我使用Spark（1.5.2）DataFrames并尝试获取分层数据集。我的数据已准备好进行二进制分类，class Point < ActiveRecord::Base belongs_to :type include Featurable def to_feature factory = RGeo::GeoJSON::EntityFactory.instance property_names = [:genus, :species, :cultivar, :common_name, :type_name] properties = property_names.inject({}) do |hash, property_name| hash[property_name] = self.send(property_name) hash end factory.feature geom, self.id, properties end delegate :type_name, to: :type featurable :geom, [:genus, :species, :cultivar, :common_name, :type] end，1和0只有两个值。

class

在控制台上，我得到的输出显示1级到0的比例非常不正确：

val Array(trainingData, testData) = df.randomSplit(Array(0.7, 0.3))
val fractions: Map[Int, Double] = Map(1 -> 0.5, 0 -> 0.5)

val trainingData3 = trainingData.stat.sampleBy("class", fractions, new Random().nextLong)

println("Training True Class = " + trainingData3.where("class=1").count())
println("Training False Class = " + trainingData3.where("class=0").count())

Answer 1

为DataFrame提供给sampleBy的分数，与“sampleByKeyExact”和RDD的sampleByKey一样，不是您在最终结果集中所需的百分比。相反，它是您希望保留原始数据集的百分比。

要获得50/50分割，您需要比较完整数据集中的第1类和第0类的计数，获取比率，然后使用它们来帮助选择分数。

因此，例如，如果98％的记录是0级，2％是1级，而你想要50/50的分割，那么你可以使用1级= 100％和0级= 2％的分数。

val fractions: Map[Int, Double] = Map(1 -> 1.0, 0 -> 0.02)

sampleBy返回非常偏斜的结果

1 个答案: