我正在尝试获取数据集的精确样本,但我不明白为什么种子和分数以它们的方式相互作用。
使用Spark 2.3.0文档:
sample(withReplacement=None, fraction=None, seed=None)
原始数据集:
+---+----------+----------+-----+
| id| name| test|score|
+---+----------+----------+-----+
| 1|Chun |SQL | 75|
| 2|Chun |Tuning | 73|
| 3|Esben |SQL | 43|
| 4|Esben |Tuning | 31|
| 5|Kaolin |SQL | 56|
| 6|Kaolin |Tuning | 88|
| 7|Tatiana |SQL | 87|
| 8|Tatiana |Tuning | 83|
+---+----------+----------+-----+
df_sample = df.sample(False, 0.5, 111)
df_sample.show()
输出:
+---+----------+----------+-----+
| id| name| test|score|
+---+----------+----------+-----+
| 1|Chun |SQL | 75|
| 2|Chun |Tuning | 73|
| 4|Esben |Tuning | 31|
| 5|Kaolin |SQL | 56|
| 6|Kaolin |Tuning | 88|
+---+----------+----------+-----+
预期的输出应始终产生4的样本大小,但是由于某种原因,当我更改种子(在这种情况下为111)时,样本大小也会发生变化。谁能确切解释“分数”和“种子”的作用,以及为什么会这样?这是一个示例:
df_sample = df.sample(False, 0.5, 45)
df_sample.show()
输出:
+---+----------+----------+-----+
| id| name| test|score|
+---+----------+----------+-----+
| 1|Chun |SQL | 75|
| 3|Esben |SQL | 43|
| 8|Tatiana |Tuning | 83|
+---+----------+----------+-----+