在csv中查找字符,在R

时间:2016-05-25 22:26:49

标签: r csv dataframe gsub

我有一个.csv文件,其基因名称如"AT1G45150"。但是,有些条目有两个由下划线连接的基因名称,因此它们看起来像第{135}中所示的"AT3G01311_ATCG00940"。是否有一个简单的命令,可能有类似gsub的内容,不仅可以找到并消除单元格中的所有内容都来自下划线,但也将第二个基因名称固定在紧邻发现的单元格的单元格中,在同一列中,但下一行向下?还希望保留该列中已有的所有内容,只需扩展列长度即可添加新成员。

"133","AT1G45150","AT1G12200","AT2G25370","AT1G19715","AT2G46830","AT1G20870","AT4G12400","AT1G19660"
"134","AT1G47280","AT1G12410","AT2G26920","AT1G19750","AT2G46850","AT1G21400","AT4G15430","AT1G19690"
"135","AT1G47317","AT1G12530","AT2G27270","AT1G20540","AT3G01311_ATCG00940","AT1G21450","AT5G01970","AT1G19750"
"136","AT1G47420","AT1G12550","AT2G28590","AT1G20570","AT3G03470","AT1G21730","AT1G20800","AT1G19780"
"137","AT1G47500","AT1G12740","AT2G28970","AT1G20580","AT3G03980","AT1G21760","AT3G54740","AT1G19790"
"138","AT1G47570","AT1G12750","AT2G29740","AT1G20610","AT3G05040","AT1G22000","AT4G12400","AT1G19970"

这样就变成了

"133","AT1G45150","AT1G12200","AT2G25370","AT1G19715","AT2G46830","AT1G20870","AT4G12400","AT1G19660"
"134","AT1G47280","AT1G12410","AT2G26920","AT1G19750","AT2G46850","AT1G21400","AT4G15430","AT1G19690"
"135","AT1G47317","AT1G12530","AT2G27270","AT1G20540","AT3G01311","AT1G21450","AT5G01970","AT1G19750"
"136","AT1G47420","AT1G12550","AT2G28590","AT1G20570","ATCG000940","AT1G21730","AT1G20800","AT1G19780"
"137","AT1G47500","AT1G12740","AT2G28970","AT1G20580","AT3G03470","AT1G21760","AT3G54740","AT1G19790"
"138","AT1G47570","AT1G12750","AT2G29740","AT1G20610","AT3G03980","AT1G22000","AT4G12400","AT1G19970"

感谢您的帮助!

编辑:尝试提供可重现的示例,希望这有用:

> dput(droplevels(genes[133:138,]))
structure(list(g99 = structure(1:6, .Label = c("AT1G45150", "AT1G47280", 
"AT1G47317", "AT1G47420", "AT1G47500", "AT1G47570"), class = "factor"), 
g95 = structure(1:6, .Label = c("AT1G12200", "AT1G12410", 
"AT1G12530", "AT1G12550", "AT1G12740", "AT1G12750"), class = "factor"), 
y99 = structure(1:6, .Label = c("AT2G25370", "AT2G26920", 
"AT2G27270", "AT2G28590", "AT2G28970", "AT2G29740"), class = "factor"), 
y95 = structure(1:6, .Label = c("AT1G19715", "AT1G19750", 
"AT1G20540", "AT1G20570", "AT1G20580", "AT1G20610"), class = "factor"), 
a99 = structure(1:6, .Label = c("AT2G46830", "AT2G46850", 
"AT3G01311_ATCG00940", "AT3G03470", "AT3G03980", "AT3G05040"
), class = "factor"), a95 = structure(1:6, .Label = c("AT1G20870", 
"AT1G21400", "AT1G21450", "AT1G21730", "AT1G21760", "AT1G22000"
), class = "factor"), e99 = structure(c(3L, 4L, 5L, 1L, 2L, 
3L), .Label = c("AT1G20800", "AT3G54740", "AT4G12400", "AT4G15430", 
"AT5G01970"), class = "factor"), e95 = structure(1:6, .Label = c("AT1G19660", 
"AT1G19690", "AT1G19750", "AT1G19780", "AT1G19790", "AT1G19970"
), class = "factor")), .Names = c("g99", "g95", "y99", "y95", 
"a99", "a95", "e99", "e95"), row.names = 133:138, class = "data.frame")

3 个答案:

答案 0 :(得分:2)

我假设这些基因是更大数据框架的一部分,有关每个基因的更多信息。我使用tidyrdplyr。这样的事情应该有效:

library(dplyr)
library(tidyr)

df <- 
  df %>% 
  separate(gene, c('first', 'second'), '_') %>% # Make two columns 
  gather(position, gene, first, second) %>%  
  filter(!is.na(gene))

我使用separate将列拆分为两个,第一列包含第一个基因,第二列包含第二个列(如果存在)。然后我使用gather将所有基因叠加在一起,并filter从缺失的第二个基因中删除行。

希望这有帮助!

答案 1 :(得分:1)

现在,我已经看到了您的数据,我得到了一个新答案。我对数据框中究竟想要的内容感到有些困惑,但这里是如何为单个向量做的。

library(stringr)

> df$a99
[1] "AT2G46830"           "AT2G46850"           "AT3G01311_ATCG00940"
[4] "AT3G03470"           "AT3G03980"           "AT3G05040"          

> unlist(str_split(df$a99, '_'))
[1] "AT2G46830" "AT2G46850" "AT3G01311" "ATCG00940" "AT3G03470" "AT3G03980"
[7] "AT3G05040"

答案 2 :(得分:0)

这个答案假设您可能希望保留数据框结构。

首先加载以下三个包: library(stringr); library(purrr); library(dplyr)

然后您的数据框如下所示:

> genes
   V1        V2        V3        V4        V5                  V6        V7        V8        V9
1 133 AT1G45150 AT1G12200 AT2G25370 AT1G19715           AT2G46830 AT1G20870 AT4G12400 AT1G19660
2 134 AT1G47280 AT1G12410 AT2G26920 AT1G19750           AT2G46850 AT1G21400 AT4G15430 AT1G19690
3 135 AT1G47317 AT1G12530 AT2G27270 AT1G20540 AT3G01311_ATCG00940 AT1G21450 AT5G01970 AT1G19750
4 136 AT1G47420 AT1G12550 AT2G28590 AT1G20570           AT3G03470 AT1G21730 AT1G20800 AT1G19780
5 137 AT1G47500 AT1G12740 AT2G28970 AT1G20580           AT3G03980 AT1G21760 AT3G54740 AT1G19790
6 138 AT1G47570 AT1G12750 AT2G29740 AT1G20610           AT3G05040 AT1G22000 AT4G12400 AT1G19970

如果我只是要攻击V6变量,我会使用stringr中的以下命令:

> str_sub(genes$V6, start = 1L, 
          end = ifelse(is.na(str_locate(genes$V6, '_')[,1]), -1,    
          str_locate(genes$V6, '_')[, 1] - 1))
[1] "AT2G46830" "AT2G46850" "AT3G01311" "AT3G03470" "AT3G03980" "AT3G05040"

但我们希望将此概括为所有变量,以防您想要保留数据框架结构。因此,使用map中的purrr函数遍历数据框中的所有列(您也可以以类似的方式使用lapply,但有时很难强制使用> genes2 <- map(genes, function(x) { str_sub(x, start = 1L, end = ifelse(is.na(str_locate(x, '_'))[,1], -1, str_locate(x, '_')[,1] - 1)) }) %>% as_data_frame() 数据帧)。

> genes2
Source: local data frame [6 x 9]

     V1        V2        V3        V4        V5        V6        V7        V8        V9
  (chr)     (chr)     (chr)     (chr)     (chr)     (chr)     (chr)     (chr)     (chr)
1   133 AT1G45150 AT1G12200 AT2G25370 AT1G19715 AT2G46830 AT1G20870 AT4G12400 AT1G19660
2   134 AT1G47280 AT1G12410 AT2G26920 AT1G19750 AT2G46850 AT1G21400 AT4G15430 AT1G19690
3   135 AT1G47317 AT1G12530 AT2G27270 AT1G20540 AT3G01311 AT1G21450 AT5G01970 AT1G19750
4   136 AT1G47420 AT1G12550 AT2G28590 AT1G20570 AT3G03470 AT1G21730 AT1G20800 AT1G19780
5   137 AT1G47500 AT1G12740 AT2G28970 AT1G20580 AT3G03980 AT1G21760 AT3G54740 AT1G19790
6   138 AT1G47570 AT1G12750 AT2G29740 AT1G20610 AT3G05040 AT1G22000 AT4G12400 AT1G19970

然后您的数据框如下所示:

{{1}}