Question

我有一个包含150,000个数据点的数据集。每个数据点都有几个字段，包括一个值列。我想对数据集进行采样，以使值较高的行比值较低的行更有可能被选中。因此，根据下面的示例，在新数据集中，值1000的项目将比值5的项目多得多。

我不确定熊猫如何做到这一点。任何人都可以帮忙吗？

╔══════════════════════════════════════╗
║ id    description    number    value ║
╠══════════════════════════════════════╣
║ 0   A           1           20       ║
║ 1   A           11          50       ║
║ 2   A           1           10       ║
║ 3   A           14          1000     ║
║ 4   A           1           20       ║
║ 5   A           13          50       ║
║ 6   A           1           800      ║
║ 7   A           1           30       ║
║ 8   A           13          5        ║
║ 9   A           12          500      ║
╚══════════════════════════════════════╝

非常感谢您的所有帮助！

Answer 1

如果要从两者中采样，则可以使用groupby函数为每个值采样不同数量的元素。 weights参数可用于获取不同值的不同权重。 Documentation

df_values = df.groupby("value").sample(n=100, weights=[1, 2])

Answer 2

您应该使用pandas weights方法，并提供一个df["value"]参数，在这种情况下将是您的df.sample(n=10, weights=df["value"])。参见this documentation。

lower_tri = np.ones((len(x), len(x)))
lower_tri[np.tril(lower_tri, -1)==1] = float('inf')
lower_tri = lower_tri * x
np.argmin(lower_tri, axis=1)

Answer 3

您可以使用权重作为列值的样本方法

df.sample(n,weights="value")

如何根据列值对数据集进行采样？

3 个答案: