将混合字符串拆分为R中的列

时间:2018-06-18 12:37:15

标签: r string split text-mining

文本分析和R编码的新手。

我有200个混合弦的基因。我想将它们分开并将字符串(例如,钙粘蛋白,孤儿受体)粘贴在一列中,并将数字(例如,2/3),数字+字符串(例如,7D,7TM)粘贴到另一列中。 我用strssplit来分词。请关于如何解析它们的任何建议都会有所帮助。

example:
 > Genes <- c("7D cadherins", "7TM orphan receptors", "7TM orphan receptors RNA18S", "28S ribosomal RNAs  RNA28S", "45S pre-ribosomal RNAs  RNA45S", "5.8S ribosomal RNAs", "Actin related protein 2/3 complex”)

Expected result(2nd and 3rd column):

7D cadherins        cadherins       7D 
7TM orphan receptors        orphan receptors        7TM   
18S ribosomal RNAs  RNA18S  ribosomal RNAs  RNA18S  18S RNA18S
28S ribosomal RNAs  RNA28S  ribosomal RNAs  RNA28S  28S  RNA28S
45S pre-ribosomal RNAs  RNA45S  pre-ribosomal RNAs      45S  RNA45S
5.8S ribosomal RNAs ribosomal RNAs  5.8S
Actin related protein 2/3 complex   Actin related protein complex    2/3 

1 个答案:

答案 0 :(得分:0)

使用strsplit拆分名称,grep检测带或不带数字的字词,paste折叠字词。将everithing放在函数中以避免重复:

wordS <- function(x, invert = TRUE) {
  clean <- gsub( '[[:space:]]+', ' ', x )  # to remove extra spaces
  split <- strsplit( clean, ' ' )
  detec <- lapply( split, function(y) grep('[0-9]', y, invert = invert, value = TRUE) )
  words <- sapply( detec, paste, collapse = ' ' )
  return( words )
}

data.frame(
  Gene = Genes,
  column2 = wordS(Genes),
  column3 = wordS(Genes, invert = FALSE)
)

                               Gene                       column2    column3
1                      7D cadherins                     cadherins         7D
2              7TM orphan receptors              orphan receptors        7TM
3       7TM orphan receptors RNA18S              orphan receptors 7TM RNA18S
4         28S ribosomal RNAs RNA28S                ribosomal RNAs 28S RNA28S
5     45S pre-ribosomal RNAs RNA45S            pre-ribosomal RNAs 45S RNA45S
6               5.8S ribosomal RNAs                ribosomal RNAs       5.8S
7 Actin related protein 2/3 complex Actin related protein complex        2/3