Question

My file has following format -

Gene | GO

Bcin01g00010|               GO:0016491 (MF: oxidoreductase  activity),GO:0003824 (MF: catalytic activity),GO:0050662^MF^coenzyme   binding

I would like to clean my it so that it looks this way -

Bcin01g00010|GO:0016491 (MF: oxidoreductase activity)

Bcin01g00010|GO:0003824 (MF: catalytic activity)

Bcin01g00010|GO:0050662^MF^coenzyme binding

Any help would be appreciated.

Answer 1

If there are few lines, you can do "by hand" this:

# Change the comma with a "return to the line" (/n) and with the name of your gene "Bcin01g00010|"
# You can change the last part with your file
Goterm <- gsub(",", "\nBcin01g00010|", "Bcin01g00010| GO:0016491 (MF: oxidoreductase activity),GO:0003824 (MF: catalytic activity),GO:0050662^MF^coenzyme binding")
# Change /n with a "real" return to the line
cat(Goterm)

It does exactly what you want, but If you have a lot of lines, you can do it in a loop.

However, when you annotated your file with Go terms, generally you can ask the program for directly arrange your file like you want (with Blast2Go for example).

Answer 2

You can split the text as your liking,

p1 <- strsplit(text, "[,|\n]")
p2 <- lapply(p1, trimws)
p3 <- p2[[1]][which(p2[[1]] != "")]
col_names <- p3[1:2]
gene_type <- p3[3]
values <- p3[4:6]

gene.table <- data.frame(Gene = rep(gene_type, length(values)), GO = values)

You can create a function of this and then use apply functions, if you have multiple lines.

Answer 3

Hard to answer without more info on your data and your desired output, but is flexible and it should get you started. It assumes you have multiple genes, your gene always has | separating it from the GO column, and the your GO columns are separated by commas. Good luck!

library(stringi)

text <- "Bcin01g00010|               GO:0016491 (MF: oxidoreductase  activity),GO:0003824 (MF: catalytic activity),GO:0050662^MF^coenzyme   binding"

# Removing spaces
normal_spaces <- stri_replace_all_regex(text, replacement = " ", pattern = "\\s+")

# Separating the gene name
split_on_vbar <- stri_split_regex(normal_spaces, "\\|", simplify = T)

# Removing white space from the string
split_on_vbar <- trimws(split_on_vbar)

# Pasting the gene name onto the "go" anf the output is a vector
formatted_vector <- paste(split_on_vbar[,1], stri_split_fixed(split_on_vbar[,2], ",", simplify = T), sep = "|")
formatted_vector
[1] "Bcin01g00010|GO:0016491 (MF: oxidoreductase activity)" "Bcin01g00010|GO:0003824 (MF: catalytic activity)"     
[3] "Bcin01g00010|GO:0050662^MF^coenzyme binding" 


# Assuming you want a dataframe...
df1 <- as.data.frame(stri_split_fixed(formatted_vector, "|", simplify = T))

names(df1) <- c("Gene", "GO")

df1
          Gene                                       GO
1 Bcin01g00010 GO:0016491 (MF: oxidoreductase activity)
2 Bcin01g00010      GO:0003824 (MF: catalytic activity)
3 Bcin01g00010           GO:0050662^MF^coenzyme binding

How to rearrange rows?

3 个答案: