我有一个.csv文件,其基因名称如"AT1G45150"
。但是,有些条目有两个由下划线连接的基因名称,因此它们看起来像第{135}中所示的"AT3G01311_ATCG00940"
。是否有一个简单的命令,可能有类似gsub
的内容,不仅可以找到并消除单元格中的所有内容都来自下划线,但也将第二个基因名称固定在紧邻发现的单元格的单元格中,在同一列中,但下一行向下?还希望保留该列中已有的所有内容,只需扩展列长度即可添加新成员。
"133","AT1G45150","AT1G12200","AT2G25370","AT1G19715","AT2G46830","AT1G20870","AT4G12400","AT1G19660"
"134","AT1G47280","AT1G12410","AT2G26920","AT1G19750","AT2G46850","AT1G21400","AT4G15430","AT1G19690"
"135","AT1G47317","AT1G12530","AT2G27270","AT1G20540","AT3G01311_ATCG00940","AT1G21450","AT5G01970","AT1G19750"
"136","AT1G47420","AT1G12550","AT2G28590","AT1G20570","AT3G03470","AT1G21730","AT1G20800","AT1G19780"
"137","AT1G47500","AT1G12740","AT2G28970","AT1G20580","AT3G03980","AT1G21760","AT3G54740","AT1G19790"
"138","AT1G47570","AT1G12750","AT2G29740","AT1G20610","AT3G05040","AT1G22000","AT4G12400","AT1G19970"
这样就变成了
"133","AT1G45150","AT1G12200","AT2G25370","AT1G19715","AT2G46830","AT1G20870","AT4G12400","AT1G19660"
"134","AT1G47280","AT1G12410","AT2G26920","AT1G19750","AT2G46850","AT1G21400","AT4G15430","AT1G19690"
"135","AT1G47317","AT1G12530","AT2G27270","AT1G20540","AT3G01311","AT1G21450","AT5G01970","AT1G19750"
"136","AT1G47420","AT1G12550","AT2G28590","AT1G20570","ATCG000940","AT1G21730","AT1G20800","AT1G19780"
"137","AT1G47500","AT1G12740","AT2G28970","AT1G20580","AT3G03470","AT1G21760","AT3G54740","AT1G19790"
"138","AT1G47570","AT1G12750","AT2G29740","AT1G20610","AT3G03980","AT1G22000","AT4G12400","AT1G19970"
感谢您的帮助!
编辑:尝试提供可重现的示例,希望这有用:
> dput(droplevels(genes[133:138,]))
structure(list(g99 = structure(1:6, .Label = c("AT1G45150", "AT1G47280",
"AT1G47317", "AT1G47420", "AT1G47500", "AT1G47570"), class = "factor"),
g95 = structure(1:6, .Label = c("AT1G12200", "AT1G12410",
"AT1G12530", "AT1G12550", "AT1G12740", "AT1G12750"), class = "factor"),
y99 = structure(1:6, .Label = c("AT2G25370", "AT2G26920",
"AT2G27270", "AT2G28590", "AT2G28970", "AT2G29740"), class = "factor"),
y95 = structure(1:6, .Label = c("AT1G19715", "AT1G19750",
"AT1G20540", "AT1G20570", "AT1G20580", "AT1G20610"), class = "factor"),
a99 = structure(1:6, .Label = c("AT2G46830", "AT2G46850",
"AT3G01311_ATCG00940", "AT3G03470", "AT3G03980", "AT3G05040"
), class = "factor"), a95 = structure(1:6, .Label = c("AT1G20870",
"AT1G21400", "AT1G21450", "AT1G21730", "AT1G21760", "AT1G22000"
), class = "factor"), e99 = structure(c(3L, 4L, 5L, 1L, 2L,
3L), .Label = c("AT1G20800", "AT3G54740", "AT4G12400", "AT4G15430",
"AT5G01970"), class = "factor"), e95 = structure(1:6, .Label = c("AT1G19660",
"AT1G19690", "AT1G19750", "AT1G19780", "AT1G19790", "AT1G19970"
), class = "factor")), .Names = c("g99", "g95", "y99", "y95",
"a99", "a95", "e99", "e95"), row.names = 133:138, class = "data.frame")
答案 0 :(得分:2)
我假设这些基因是更大数据框架的一部分,有关每个基因的更多信息。我使用tidyr
和dplyr
。这样的事情应该有效:
library(dplyr)
library(tidyr)
df <-
df %>%
separate(gene, c('first', 'second'), '_') %>% # Make two columns
gather(position, gene, first, second) %>%
filter(!is.na(gene))
我使用separate
将列拆分为两个,第一列包含第一个基因,第二列包含第二个列(如果存在)。然后我使用gather
将所有基因叠加在一起,并filter
从缺失的第二个基因中删除行。
希望这有帮助!
答案 1 :(得分:1)
现在,我已经看到了您的数据,我得到了一个新答案。我对数据框中究竟想要的内容感到有些困惑,但这里是如何为单个向量做的。
library(stringr)
> df$a99
[1] "AT2G46830" "AT2G46850" "AT3G01311_ATCG00940"
[4] "AT3G03470" "AT3G03980" "AT3G05040"
> unlist(str_split(df$a99, '_'))
[1] "AT2G46830" "AT2G46850" "AT3G01311" "ATCG00940" "AT3G03470" "AT3G03980"
[7] "AT3G05040"
答案 2 :(得分:0)
这个答案假设您可能希望保留数据框结构。
首先加载以下三个包:
library(stringr); library(purrr); library(dplyr)
然后您的数据框如下所示:
> genes
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 133 AT1G45150 AT1G12200 AT2G25370 AT1G19715 AT2G46830 AT1G20870 AT4G12400 AT1G19660
2 134 AT1G47280 AT1G12410 AT2G26920 AT1G19750 AT2G46850 AT1G21400 AT4G15430 AT1G19690
3 135 AT1G47317 AT1G12530 AT2G27270 AT1G20540 AT3G01311_ATCG00940 AT1G21450 AT5G01970 AT1G19750
4 136 AT1G47420 AT1G12550 AT2G28590 AT1G20570 AT3G03470 AT1G21730 AT1G20800 AT1G19780
5 137 AT1G47500 AT1G12740 AT2G28970 AT1G20580 AT3G03980 AT1G21760 AT3G54740 AT1G19790
6 138 AT1G47570 AT1G12750 AT2G29740 AT1G20610 AT3G05040 AT1G22000 AT4G12400 AT1G19970
如果我只是要攻击V6
变量,我会使用stringr
中的以下命令:
> str_sub(genes$V6, start = 1L,
end = ifelse(is.na(str_locate(genes$V6, '_')[,1]), -1,
str_locate(genes$V6, '_')[, 1] - 1))
[1] "AT2G46830" "AT2G46850" "AT3G01311" "AT3G03470" "AT3G03980" "AT3G05040"
但我们希望将此概括为所有变量,以防您想要保留数据框架结构。因此,使用map
中的purrr
函数遍历数据框中的所有列(您也可以以类似的方式使用lapply
,但有时很难强制使用> genes2 <- map(genes, function(x) { str_sub(x, start = 1L,
end = ifelse(is.na(str_locate(x, '_'))[,1], -1,
str_locate(x, '_')[,1] - 1)) })
%>% as_data_frame()
数据帧)。
> genes2
Source: local data frame [6 x 9]
V1 V2 V3 V4 V5 V6 V7 V8 V9
(chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
1 133 AT1G45150 AT1G12200 AT2G25370 AT1G19715 AT2G46830 AT1G20870 AT4G12400 AT1G19660
2 134 AT1G47280 AT1G12410 AT2G26920 AT1G19750 AT2G46850 AT1G21400 AT4G15430 AT1G19690
3 135 AT1G47317 AT1G12530 AT2G27270 AT1G20540 AT3G01311 AT1G21450 AT5G01970 AT1G19750
4 136 AT1G47420 AT1G12550 AT2G28590 AT1G20570 AT3G03470 AT1G21730 AT1G20800 AT1G19780
5 137 AT1G47500 AT1G12740 AT2G28970 AT1G20580 AT3G03980 AT1G21760 AT3G54740 AT1G19790
6 138 AT1G47570 AT1G12750 AT2G29740 AT1G20610 AT3G05040 AT1G22000 AT4G12400 AT1G19970
然后您的数据框如下所示:
{{1}}