基于标识符的Grep行

时间:2015-01-20 15:45:15

标签: r awk grep

我的数据框如下所示:

hsa-let-7a-3p   45
hsa-let-7a-5p   1148
hsa-let-7b-3p   8
hsa-let-7b-5p   184
hsa-let-7c-3p   1
hsa-let-7c-5p   258
hsa-let-7d-5p   343

我想计算每个标识符有3p和5p的行数,而且只有3p且只有5p。例如hsa-let-7a hsa-let-7bhsa-let-7c都有3p和5p。但是,hsa-let-7d只有5p。我不关心背后的数字。我更喜欢基于grep的解决方案,但R也会很好。

输出:

Both 3p and 5p: 3
Only 3p: 0
Only 5p: 1

我的尝试我R:

> head(Meister_Ago1,20)


             V1   V2
1  hsa-let-7a-2-3p    1
2    hsa-let-7a-3p   45
3    hsa-let-7a-5p 1148
4    hsa-let-7b-3p    8
5    hsa-let-7b-5p  184
6    hsa-let-7c-3p    1
7    hsa-let-7c-5p  258
8    hsa-let-7d-3p   22
9    hsa-let-7d-5p  142
10   hsa-let-7e-3p    1
11   hsa-let-7e-5p  114
12 hsa-let-7f-1-3p    1
13 hsa-let-7f-2-3p   10
14   hsa-let-7f-5p  794
15   hsa-let-7g-3p    2
16   hsa-let-7g-5p   94
17   hsa-let-7i-3p    2
18   hsa-let-7i-5p   97
19    hsa-miR-1-3p    4
20    hsa-miR-1-5p    0

2 个答案:

答案 0 :(得分:2)

可能是

grp <-  sub('-..$', '', df$Col1)
val <- sub('.*(..)$', '\\1', df$Col1)
tbl <- table(grp, val)
sum(rowSums(tbl)==2)
#[1] 3

或者

sum(tbl[,1] &tbl[,2])
 #[1] 3
sum(tbl[,1]==0 & tbl[,2]!=0)
#[1] 1
 sum(tbl[,1]!=0 & tbl[,2]==0)
#[1] 0

更新

基于更新的数据“Meister_Ago1”

  grp <- sub('-..$', '', Meister_Ago1$V1)
  val <- sub('.*(..)$', '\\1', Meister_Ago1$V1)
  tbl <- table(grp, val)

  sum(tbl[,1] & tbl[,2])
  #[1] 8
  sum(tbl[,1]==0 & tbl[,2]!=0)
  #[1] 1
   sum(tbl[,1]!=0 & tbl[,2]==0)
  #[1] 3

数据

df <- structure(list(Col1 = c("hsa-let-7a-3p", "hsa-let-7a-5p",
"hsa-let-7b-3p", "hsa-let-7b-5p", "hsa-let-7c-3p", "hsa-let-7c-5p", 
"hsa-let-7d-5p"), Col2 = c(45L, 1148L, 8L, 184L, 1L, 258L, 343L)), 
.Names = c("Col1", "Col2"), class = "data.frame", row.names = c(NA, 
-7L))


Meister_Ago1 <- structure(list(V1 = c("hsa-let-7a-2-3p", "hsa-let-7a-3p", 
 "hsa-let-7a-5p", "hsa-let-7b-3p", "hsa-let-7b-5p", "hsa-let-7c-3p", 
 "hsa-let-7c-5p", "hsa-let-7d-3p", "hsa-let-7d-5p", "hsa-let-7e-3p", 
 "hsa-let-7e-5p", 
 "hsa-let-7f-1-3p", "hsa-let-7f-2-3p", "hsa-let-7f-5p", "hsa-let-7g-3p", 
 "hsa-let-7g-5p", "hsa-let-7i-3p", "hsa-let-7i-5p", "hsa-miR-1-3p", 
 "hsa-miR-1-5p"), V2 = c(1L, 45L, 1148L, 8L, 184L, 1L, 258L, 22L, 
  142L, 1L, 114L, 1L, 10L, 794L, 2L, 94L, 2L, 97L, 4L, 0L)), 
 .Names = c("V1", "V2"), class = "data.frame", row.names = 
 c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", 
 "13", "14", "15", "16", "17", "18", "19", "20"))

答案 1 :(得分:0)

这个awk代码应该这样做:

 awk '{s=h=$1;sub(/-.p$/,"",h);all[h]}
        s~/-3p$/{a[h]} s~/-5p$/{b[h]}
        END{ for(x in all)
                if( x in b && x in a){
                        ca++;
                        delete b[x]
                        delete a[x]
                }
        printf "Both 3p and 5p:%d\n", ca
        printf "Only 3p :%d\n", length(a)
        printf "Only 5p :%d\n", length(b)
}' file

输出:

Both 3p and 5p:3
Only 3p :0
Only 5p :1