Question

我有一个包含＆gt; 400,000行和~200列的表格。每行都有一个包含位置编号的列，范围从0到140，可以是小数（例如45.6345）。我已经按位置增量为5对行进行分箱。我的第一个bin包含位置（0-5）的所有数据行。我的最后一个bin包含位置（135,140）的行。要对数据进行分区，我使用了以下代码。 / p>

#what is the maximum bin value. Add 1 in case the value is a decimal
maxposbin = max(ceiling(data$POS),na.rm=TRUE)+1
#what is the maximum position value
maxposvalue = max(data$POS, na.rm=TRUE)
#Assign the positions to a variable
posvalues = data$POS
#Cut the position values into bins by intervals of 5
posbin = cut(posvalues, breaks=seq(from=0,to=maxposbin, by=5))
#Make a frequency table to see how many rows are in each bin
posbinned = as.data.frame(table(posbin))
#Plot the frequency distribution
barplot(posbinned$Freq)

我的posbinned表看起来像这样：

  posbin   Freq     binprob
1      (0,5]   8533 0.031925105
2     (5,10]   7318 0.037225597
3    (10,15]   9324 0.029216744
4    (15,20]  10576 0.025758029
5    (20,25]   7065 0.038558658
6    (25,30]   3178 0.085719609
7    (30,35]   5900 0.046172359
8    (35,40]   8132 0.033499375
9    (40,45]   8335 0.032683493
10   (45,50]  16409 0.016601677
11   (50,55]  20481 0.013300958
12   (55,60]  25978 0.010486447
13   (60,65] 161292 0.001688967
14   (65,70]  26063 0.010452247
15   (70,75]  11427 0.023839758
16   (75,80]  11232 0.024253643
17   (80,85]   5129 0.053113066
18   (85,90]  11180 0.024366451
19   (90,95]   4188 0.065047019
20  (95,100]   9871 0.027597702
21 (100,105]  13645 0.019964596
22 (105,110]  13294 0.020491719
23 (110,115]   8791 0.030988160
24 (115,120]   3583 0.076030398
25 (120,125]   4874 0.055891858
26 (125,130]   7304 0.037296949
27 (130,135]   2997 0.090896536
28 (135,140]   7376 0.036932879

我想根据分配给每个bin的概率在此数据集中选择一个已定义的行数。我得到的样本应该在位置（0到140）之间均匀分布样本。例如，bin 13在该bin中具有最高行数，因此将分配从该bin中选择行的最低概率。 Bin 27具有最少的行数，并且应具有最高的选择概率。每个箱应该与所得样本中的每个其他箱大致相等地表示。我为每个bin分配了一个概率，它包含在变量posbinned $ binprob中。

我计算了相对于包含最少行的bin 27的bin概率。例如，bin 7的行数大约是bin 27的两倍，因此应该是将行选为bin 27的可能性的一半。然后我调整了所以28个bin概率的总和等于1.我对我有点粗糙那么概率统计可能不是考虑这个问题的正确方法吗？

如何使用“posbinned”表中bin定义的设置概率从“数据”中取样而无需替换？目前我没有一个包含位置及其相应bin的表（例如（0,5））。我只是不确定最好的方法是什么。

谢谢。

Answer 1

第一步是识别data中每行的bin。由于您的箱子从（但不包括）0开始增量为5，因此可以通过简单的算术完成：

bin.number <- ceiling(data$POS / 5)

接下来，您需要访问每行的bin频率：

bin.freq <- posbinned$Freq[bin.number]

然后，您需要在没有替换的情况下进行采样，概率与1除以bin频率成比例：

num.to.sample <- 100    # Select the number of samples you want
rows <- sample(1:nrow(data), size=num.to.sample, replace=FALSE, prob=1/bin.freq)

样本选择，概率分配给分箱样本

1 个答案: