如何查看表中具有特定元素的行的比例

时间:2015-01-28 09:45:39

标签: r

我的数据框如下:

     region          LINE   
chr1-810865-3198369  L1MC4a  
chr1-810865-3198369  L1E33  
chr1-810865-3198369  L1MB5  
chr1-810865-3198369  L1MEc  
chr1-810865-3198369  L2a  
chr1-810865-3198369  L1M5  
chr2-100655-1344334  L1M5  
chr2-100655-1344334  L1E33  
etc.

我想看看$ start到$ end中指定的UNIQUE区域有多少在$ LINE中有每个LINE。我想得到一个输出:

%OfAllRegions   LINE

 75%            L1M5
 53%            L1E33 etc.

1 个答案:

答案 0 :(得分:2)

您的问题并不是很清楚,因为您提供的数据集中有许多其他变量似乎无关,但您似乎正在寻找以下内容

library(data.table)
(Res <- setDT(df)[, as.list(round(prop.table(table(LINE)) * 100)), .(start, end)])
#     start     end L1E33 L1M5 L1MB5 L1MC4a L1MEc L2a
# 1: 810865 3198369    17   17    17     17    17  17
# 2: 100655 1344334    50   50     0      0     0   0

如果您想添加百分比,您只需执行以下操作即可

Res[, names(Res)[-(1:2)] := lapply(.SD, paste0, "%"), .SDcols = -c("start", "end")][]
#     start     end L1E33 L1M5 L1MB5 L1MC4a L1MEc L2a
# 1: 810865 3198369   17%  17%   17%    17%   17% 17%
# 2: 100655 1344334   50%  50%    0%     0%    0%  0%

数据

df <- structure(list(chr = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L), .Label = c("chr1", "chr2"), class = "factor"), start = c(810865L, 
810865L, 810865L, 810865L, 810865L, 810865L, 100655L, 100655L
), end = c(3198369L, 3198369L, 3198369L, 3198369L, 3198369L, 
3198369L, 1344334L, 1344334L), chr2 = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L), .Label = "chr1", class = "factor"), start2 = c(814631L, 
818064L, 840645L, 849835L, 892914L, 918475L, 106773L, 107999L
), end2 = c(823247L, 822563L, 841179L, 850777L, 894175L, 919243L, 
107889L, 109923L), LINE = structure(c(4L, 1L, 3L, 5L, 6L, 2L, 
2L, 1L), .Label = c("L1E33", "L1M5", "L1MB5", "L1MC4a", "L1MEc", 
"L2a"), class = "factor"), d = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L)), .Names = c("chr", "start", "end", "chr2", "start2", "end2", 
"LINE", "d"), class = "data.frame", row.names = c(NA, -8L))