包括R中随机抽样后的缺失值(合并向量并包含缺失值为0)

时间:2014-09-19 21:10:15

标签: r

我正在尝试进行许多随机抽样试验,在这些抽样中,我每次都可能得不到所有东西。

现在,我的工作是

test <- sample(rownames(data), size=10000, replace=T, prob=data$refFraction)

并非每个rowname(data)都表示在此,但我需要它用于下一步。

我想拥有它,所以每次sample我都有相同的长度(和顺序)向量,这样我就可以将每个采样组合成一个矩阵(我也不确定如何做到最好 - 如何制作数千个测试向量并使用其中一个应用函数将它们合并?)

编辑:根据答案,我想出了这个:

trials <- function(fractions, kmers, times, ref_size) {
    replicate(times, sample(kmers, size=ref_size, replace=T, prob=fractions), simplify=F)
}

result <- trials(data$refFraction, rownames(data), 100, 1000)
mat <- matrix(result, nrow=100)

但是我仍然只想要计算行中每个事物的次数,同时也没有计数,所以我最终得到一个偶数矩阵。

所需的结果如下:

         "A" "B" "C"
Trial1    2    5   6
Trial2    3    7   12
Trial3    0    5   14

dput(头(数据)):

structure(list(refCount = c(3142L, 4102L, 1975L, 2009L, 2363L, 
2437L), refFraction = c(0.00300290255094, 0.00392040301208, 0.00188756605287, 
0.00192006086086, 0.00225838915591, 0.00232911314979), readCount = c(147L, 
719L, 356L, 418L, 745L, 766L), readFraction = c(0.00029577107721, 
0.00144666261574, 0.000716289139367, 0.000841036124312, 0.00149897586749, 
0.00154122887852), foldChange = c(2.31774884958, 0.996935198459, 
0.968959564031, 0.825477549838, 0.409869676355, 0.412907501432
), p_value = c(5.05923221341436e-321, 4.46023836252119e-170, 
2.29230878162415e-77, 1.73499617494115e-59, 2.80547347576314e-15, 
4.32620038741552e-16)), .Names = c("refCount", "refFraction", 
"readCount", "readFraction", "foldChange", "p_value"), row.names = c("AAAAA", 
"AAAAT", "AAAAG", "AAAAC", "AAATA", "AAATT"), class = "data.frame")

2 个答案:

答案 0 :(得分:1)

目前还不清楚你要做什么,但似乎这可能会有所帮助。

replicate非常适合重复采样。在这里,我创建了一个5行数据框d,然后在十个单独的时间内对行名称进行采样。当以这种方式使用时,replicate会产生矩阵,因此听起来您可能需要这种方法。

> d <- data.frame(x = 1:5, y = 6:10)
> replicate(10, sample(rownames(d)))
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] "5"  "1"  "1"  "3"  "4"  "1"  "4"  "5"  "3"  "1"  
# [2,] "4"  "5"  "2"  "2"  "3"  "5"  "1"  "2"  "1"  "2"  
# [3,] "1"  "4"  "5"  "5"  "5"  "4"  "3"  "3"  "2"  "3"  
# [4,] "2"  "3"  "3"  "1"  "1"  "2"  "2"  "4"  "4"  "5"  
# [5,] "3"  "2"  "4"  "4"  "2"  "3"  "5"  "1"  "5"  "4" 

答案 1 :(得分:0)

这就是我最终做到的方式:

trial_fn <- function(counts) {
   replicate(num_trials, sample(counts, size=trial_size, replace=F), simplify=F)
}

tableize <- function(x) {
    tmp <- matrix(table(factor(x, levels=1:1024)))[,1]
    tmp/sum(tmp)
}

counts <- vector()
for (i in 1:1024) {
    counts <- c(counts, rep(i, times=data[i,]$readCount))
}

trials <- trial_fn(counts)
trial_table <- sapply(trials, tableize)

factorlevels一起使用,然后在结果上使用table就是原始问题的答案。