我的数据框如下所示:
hsa-let-7a-3p 45
hsa-let-7a-5p 1148
hsa-let-7b-3p 8
hsa-let-7b-5p 184
hsa-let-7c-3p 1
hsa-let-7c-5p 258
hsa-let-7d-5p 343
我想计算每个标识符有3p和5p的行数,而且只有3p且只有5p。例如hsa-let-7a
hsa-let-7b
和hsa-let-7c
都有3p和5p。但是,hsa-let-7d
只有5p。我不关心背后的数字。我更喜欢基于grep的解决方案,但R也会很好。
输出:
Both 3p and 5p: 3
Only 3p: 0
Only 5p: 1
我的尝试我R:
> head(Meister_Ago1,20)
V1 V2
1 hsa-let-7a-2-3p 1
2 hsa-let-7a-3p 45
3 hsa-let-7a-5p 1148
4 hsa-let-7b-3p 8
5 hsa-let-7b-5p 184
6 hsa-let-7c-3p 1
7 hsa-let-7c-5p 258
8 hsa-let-7d-3p 22
9 hsa-let-7d-5p 142
10 hsa-let-7e-3p 1
11 hsa-let-7e-5p 114
12 hsa-let-7f-1-3p 1
13 hsa-let-7f-2-3p 10
14 hsa-let-7f-5p 794
15 hsa-let-7g-3p 2
16 hsa-let-7g-5p 94
17 hsa-let-7i-3p 2
18 hsa-let-7i-5p 97
19 hsa-miR-1-3p 4
20 hsa-miR-1-5p 0
答案 0 :(得分:2)
可能是
grp <- sub('-..$', '', df$Col1)
val <- sub('.*(..)$', '\\1', df$Col1)
tbl <- table(grp, val)
sum(rowSums(tbl)==2)
#[1] 3
或者
sum(tbl[,1] &tbl[,2])
#[1] 3
sum(tbl[,1]==0 & tbl[,2]!=0)
#[1] 1
sum(tbl[,1]!=0 & tbl[,2]==0)
#[1] 0
基于更新的数据“Meister_Ago1”
grp <- sub('-..$', '', Meister_Ago1$V1)
val <- sub('.*(..)$', '\\1', Meister_Ago1$V1)
tbl <- table(grp, val)
sum(tbl[,1] & tbl[,2])
#[1] 8
sum(tbl[,1]==0 & tbl[,2]!=0)
#[1] 1
sum(tbl[,1]!=0 & tbl[,2]==0)
#[1] 3
df <- structure(list(Col1 = c("hsa-let-7a-3p", "hsa-let-7a-5p",
"hsa-let-7b-3p", "hsa-let-7b-5p", "hsa-let-7c-3p", "hsa-let-7c-5p",
"hsa-let-7d-5p"), Col2 = c(45L, 1148L, 8L, 184L, 1L, 258L, 343L)),
.Names = c("Col1", "Col2"), class = "data.frame", row.names = c(NA,
-7L))
Meister_Ago1 <- structure(list(V1 = c("hsa-let-7a-2-3p", "hsa-let-7a-3p",
"hsa-let-7a-5p", "hsa-let-7b-3p", "hsa-let-7b-5p", "hsa-let-7c-3p",
"hsa-let-7c-5p", "hsa-let-7d-3p", "hsa-let-7d-5p", "hsa-let-7e-3p",
"hsa-let-7e-5p",
"hsa-let-7f-1-3p", "hsa-let-7f-2-3p", "hsa-let-7f-5p", "hsa-let-7g-3p",
"hsa-let-7g-5p", "hsa-let-7i-3p", "hsa-let-7i-5p", "hsa-miR-1-3p",
"hsa-miR-1-5p"), V2 = c(1L, 45L, 1148L, 8L, 184L, 1L, 258L, 22L,
142L, 1L, 114L, 1L, 10L, 794L, 2L, 94L, 2L, 97L, 4L, 0L)),
.Names = c("V1", "V2"), class = "data.frame", row.names =
c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12",
"13", "14", "15", "16", "17", "18", "19", "20"))
答案 1 :(得分:0)
这个awk代码应该这样做:
awk '{s=h=$1;sub(/-.p$/,"",h);all[h]}
s~/-3p$/{a[h]} s~/-5p$/{b[h]}
END{ for(x in all)
if( x in b && x in a){
ca++;
delete b[x]
delete a[x]
}
printf "Both 3p and 5p:%d\n", ca
printf "Only 3p :%d\n", length(a)
printf "Only 5p :%d\n", length(b)
}' file
输出:
Both 3p and 5p:3
Only 3p :0
Only 5p :1