My file has following format -
Gene | GO Bcin01g00010| GO:0016491 (MF: oxidoreductase activity),GO:0003824 (MF: catalytic activity),GO:0050662^MF^coenzyme binding
I would like to clean my it so that it looks this way -
Bcin01g00010|GO:0016491 (MF: oxidoreductase activity) Bcin01g00010|GO:0003824 (MF: catalytic activity) Bcin01g00010|GO:0050662^MF^coenzyme binding
Any help would be appreciated.
答案 0 :(得分:0)
If there are few lines, you can do "by hand" this:
# Change the comma with a "return to the line" (/n) and with the name of your gene "Bcin01g00010|"
# You can change the last part with your file
Goterm <- gsub(",", "\nBcin01g00010|", "Bcin01g00010| GO:0016491 (MF: oxidoreductase activity),GO:0003824 (MF: catalytic activity),GO:0050662^MF^coenzyme binding")
# Change /n with a "real" return to the line
cat(Goterm)
It does exactly what you want, but If you have a lot of lines, you can do it in a loop.
However, when you annotated your file with Go terms, generally you can ask the program for directly arrange your file like you want (with Blast2Go for example).
答案 1 :(得分:0)
You can split the text as your liking,
p1 <- strsplit(text, "[,|\n]")
p2 <- lapply(p1, trimws)
p3 <- p2[[1]][which(p2[[1]] != "")]
col_names <- p3[1:2]
gene_type <- p3[3]
values <- p3[4:6]
gene.table <- data.frame(Gene = rep(gene_type, length(values)), GO = values)
You can create a function of this and then use apply
functions, if you have multiple lines.
答案 2 :(得分:0)
Hard to answer without more info on your data and your desired output, but is flexible and it should get you started. It assumes you have multiple genes, your gene always has |
separating it from the GO
column, and the your GO
columns are separated by commas. Good luck!
library(stringi)
text <- "Bcin01g00010| GO:0016491 (MF: oxidoreductase activity),GO:0003824 (MF: catalytic activity),GO:0050662^MF^coenzyme binding"
# Removing spaces
normal_spaces <- stri_replace_all_regex(text, replacement = " ", pattern = "\\s+")
# Separating the gene name
split_on_vbar <- stri_split_regex(normal_spaces, "\\|", simplify = T)
# Removing white space from the string
split_on_vbar <- trimws(split_on_vbar)
# Pasting the gene name onto the "go" anf the output is a vector
formatted_vector <- paste(split_on_vbar[,1], stri_split_fixed(split_on_vbar[,2], ",", simplify = T), sep = "|")
formatted_vector
[1] "Bcin01g00010|GO:0016491 (MF: oxidoreductase activity)" "Bcin01g00010|GO:0003824 (MF: catalytic activity)"
[3] "Bcin01g00010|GO:0050662^MF^coenzyme binding"
# Assuming you want a dataframe...
df1 <- as.data.frame(stri_split_fixed(formatted_vector, "|", simplify = T))
names(df1) <- c("Gene", "GO")
df1
Gene GO
1 Bcin01g00010 GO:0016491 (MF: oxidoreductase activity)
2 Bcin01g00010 GO:0003824 (MF: catalytic activity)
3 Bcin01g00010 GO:0050662^MF^coenzyme binding