如何根据一列中的值对数据进行分区,并对另一列中的出现次数进行计数,不包括R中的重复项?

时间:2015-08-24 20:34:01

标签: r count bins

我有一个关联的r值文件。我想将r值拆分成bin并计算每个bin中有多少CNV。有没有重复的方法可以做到这一点?

GeneChr   SNP   SNP_Position          CNV           start       end         r-value
1   rs7520551   100716167   1:101161140-101161459   100161140   102161459   0.950231679
1   rs6702766   100997635   1:101161140-101161459   100161140   102161459   0.376573375
1   rs11588568  101426960   1:101161140-101161459   100161140   102161459   0.252772248
1   rs4332900   10236894    1:10405137-10406094     9405137     11406094    0.171113128
1   rs11678947  10307395    1:10405137-10406094     9405137     11406094    0.334359684
1   rs2357468   10341468    1:10405137-10406094     9405137     11406094    0.30932652
1   rs1918705   10693478    1:10405137-10406094     9405137     11406094    0.822784876
1   rs7570190   101528047   1:101161140-101161459   100161140   102161459   0.391963719
1   rs643841    110832827   1:110028467-110029625   109028467   111029625   0.070643341
1   rs7514102   110998854   1:110028467-110029625   109028467   111029625   0.548219745
1   rs4676225   109609765   1:110028467-110029625   109028467   111029625   0.035118621
1   rs7608232   101699063   1:101161140-101161459   100161140   102161459   0.951958567
1   rs1449308   100708996   1:101161140-101161459   100161140   102161459   0.703308687

我有这条线来分割数据,只需要计算CNV而不重复计数。

xNew <- table(cut(CorTestMatrix$test, breaks=c(0,0.1,0.2, 0.3, 0.4, 0.5,1)))

我只想知道每个垃圾箱中有多少CNV。

2 个答案:

答案 0 :(得分:2)

这会有效吗?

df <- data.frame(CNV=c("1:10405137","1:10405137","1:10405137","1:101161140","1:110028467")
     ,r_value=c(0.035118621,0.070643341,0.391963719,0.376573375,0.950231679))

> df # minimal example
          CNV    r_value
1  1:10405137 0.03511862
2  1:10405137 0.07064334
3  1:10405137 0.39196372
4 1:101161140 0.37657337
5 1:110028467 0.95023168

df1 <- transform(df, group=cut(r_value, 
                        breaks=c(0,0.1,0.2, 0.3, 0.4, 0.5,1),
                        labels=c("<0.1","0.1","0.2", "0.3", "0.4", "0.5<")))

res <- do.call(data.frame,aggregate(r_value~group, df1, 
                                    FUN=function(x) c(Count=length(x))))

> res # counts of intervals
  group r_value
1  <0.1       2
2   0.3       2
3  0.5<       1

dNew <- data.frame(group=levels(df1$group))
dNew <- merge(res, dNew, all=TRUE)
colnames(dNew) <- c("interval","count")

> dNew # count of CNV by interval
  interval count
1     <0.1     2
2      0.1    NA
3      0.2    NA
4      0.3     2
5      0.4    NA
6     0.5<     1

改编自Group/bin/bucket data in R and get count per bucket and sum of values per bucket

答案 1 :(得分:0)

这是dplyr方法。 (请注意,如果您想计算不重复(CNV),那是一个很小的变化)。

library(dplyr)

df %>% mutate(binned_r_value = cut(df$r_value, breaks=c(0,0.1,0.2,0.3,0.4,0.5,1))) %>%
  group_by(binned_r_value) %>%
  tally()

# A tibble: 3 x 2
  binned_r_value     n
  <fct>          <int>
1 (0,0.1]            2
2 (0.3,0.4]          2
3 (0.5,1]            1