文本分析和R编码的新手。
我有200个混合弦的基因。我想将它们分开并将字符串(例如,钙粘蛋白,孤儿受体)粘贴在一列中,并将数字(例如,2/3),数字+字符串(例如,7D,7TM)粘贴到另一列中。 我用strssplit来分词。请关于如何解析它们的任何建议都会有所帮助。
example:
> Genes <- c("7D cadherins", "7TM orphan receptors", "7TM orphan receptors RNA18S", "28S ribosomal RNAs RNA28S", "45S pre-ribosomal RNAs RNA45S", "5.8S ribosomal RNAs", "Actin related protein 2/3 complex”)
Expected result(2nd and 3rd column):
7D cadherins cadherins 7D
7TM orphan receptors orphan receptors 7TM
18S ribosomal RNAs RNA18S ribosomal RNAs RNA18S 18S RNA18S
28S ribosomal RNAs RNA28S ribosomal RNAs RNA28S 28S RNA28S
45S pre-ribosomal RNAs RNA45S pre-ribosomal RNAs 45S RNA45S
5.8S ribosomal RNAs ribosomal RNAs 5.8S
Actin related protein 2/3 complex Actin related protein complex 2/3
答案 0 :(得分:0)
使用strsplit
拆分名称,grep
检测带或不带数字的字词,paste
折叠字词。将everithing放在函数中以避免重复:
wordS <- function(x, invert = TRUE) {
clean <- gsub( '[[:space:]]+', ' ', x ) # to remove extra spaces
split <- strsplit( clean, ' ' )
detec <- lapply( split, function(y) grep('[0-9]', y, invert = invert, value = TRUE) )
words <- sapply( detec, paste, collapse = ' ' )
return( words )
}
data.frame(
Gene = Genes,
column2 = wordS(Genes),
column3 = wordS(Genes, invert = FALSE)
)
Gene column2 column3
1 7D cadherins cadherins 7D
2 7TM orphan receptors orphan receptors 7TM
3 7TM orphan receptors RNA18S orphan receptors 7TM RNA18S
4 28S ribosomal RNAs RNA28S ribosomal RNAs 28S RNA28S
5 45S pre-ribosomal RNAs RNA45S pre-ribosomal RNAs 45S RNA45S
6 5.8S ribosomal RNAs ribosomal RNAs 5.8S
7 Actin related protein 2/3 complex Actin related protein complex 2/3