I have a column (geneDesc) in a data frame (bacteria) that I want to split into two columns. The column contains the gene ID and the species name of the organism the gene comes from in brackets.
For example:
geneDesc
hypothetical protein, partial [Vibrio shilonii]
ankyrin repeat protein [Leptospira kirschneri]
helicase [Alteromonas macleodii]
I'm using the following command:
bacteria2 <- separate(bacteria, geneDesc, c("gene", "species"), sep = "\\[")
But I get this error:
Error: Values not split into 2 pieces at 341, 342, 448, 450, etc...
Is there a way to run the command anyway and just create another column where there is another "["? Everything after the first bracket is of no interest.
答案 0 :(得分:1)
您几乎拥有它,但您的sep
正则表达式需要调整为匹配[
或]
:
library(tidyr)
bacteria %>% separate(geneDesc,c("gene","species"), sep="[\\[\\]]", extra="drop")
输出:
gene species
1 hypothetical protein, partial Vibrio shilonii
2 ankyrin repeat protein Leptospira kirschneri
3 helicase Alteromonas macleodii
答案 1 :(得分:0)
separate(..., extra = "drop")
或
separate(..., extra = "merge")
另一种选择是
library(stringr)
library(dplyr)
bacteria %>%
mutate(gene = geneDesc %>% str_replace_all(" *\\[.*$", "") )
答案 2 :(得分:0)
如果您只想删除第一个括号后的所有内容,我建议gsub
> df <- read.table(text='hypothetical protein, partial [Vibrio shilonii]
+ ankyrin repeat protein [Leptospira kirschneri]
+ helicase [Alteromonas macleodii]', sep='\n')
> df
V1
1 hypothetical protein, partial [Vibrio shilonii]
2 ankyrin repeat protein [Leptospira kirschneri]
3 helicase [Alteromonas macleodii]
> gsub('\\s+\\[.*$', '', df$V1)
[1] "hypothetical protein, partial" "ankyrin repeat protein" "helicase"
> data.frame(data=gsub('\\s+\\[.*$', '', df$V1))
data
1 hypothetical protein, partial
2 ankyrin repeat protein
3 helicase