道歉,如果之前有过类似的问题,我无法找到,可能是因为问题的措辞。
一些当前样本数据如下所示,其中第一列是标识符(基因)列表,第二列是一组描述符(基因本体ID):
Gene Gene_Ontology_ID
Gene1 GO1, GO2, GO4, GO6
Gene2 GO2, GO3, GO4
Gene3 GO5, GO7
我想知道是否有一种有效的方法来转换类似格式的大表,以便" Gene_Ontology_ID"列现在用作识别列," Gene" column现在是Gene_Ontology_ID的基因列表,如下所示:
Gene_Ontology_ID Gene
GO1 Gene1
GO2 Gene1,Gene2
GO3 Gene2
GO4 Gene1,Gene2
GO5 Gene3
GO6 Gene1
GO7 Gene3
是否有解决方案,最好使用Unix,Python或R?非常感谢任何帮助,谢谢。
答案 0 :(得分:1)
library(dplyr)
library(tidyr)
out <- df %>% separate(Gene_Ontology_ID, into=paste("genes", 1:7, sep = "_"),sep =", ",fill="right") %>%
gather(key,Gene_Ontology_ID, -Gene,na.rm=TRUE) %>%
arrange(Gene_Ontology_ID,Gene) %>%
group_by(Gene_Ontology_ID) %>%
summarise(Gene= paste(Gene,collapse =", "))
out
# A tibble: 7 x 2
Gene_Ontology_ID Gene
<chr> <chr>
1 GO1 Gene1
2 GO2 Gene1, Gene2
3 GO3 Gene2
4 GO4 Gene1, Gene2
5 GO5 Gene3
6 GO6 Gene1
7 GO7 Gene3
答案 1 :(得分:0)
strsplit
和unnest
的类似选项:
library(dplyr)
library(tidyr)
df %>%
mutate(Gene_Ontology_ID = strsplit(Gene_Ontology_ID, ", ")) %>%
unnest(Gene_Ontology_ID) %>%
group_by(Gene_Ontology_ID) %>%
summarise(Gene = paste(Gene, collapse = ", "))
# A tibble: 7 x 2
Gene_Ontology_ID Gene
<chr> <chr>
1 GO1 Gene1
2 GO2 Gene1, Gene2
3 GO3 Gene2
4 GO4 Gene1, Gene2
5 GO5 Gene3
6 GO6 Gene1
7 GO7 Gene3
答案 2 :(得分:0)
使用data.table
中的cSplit
和splitstackshape
: -
library(data.table)
library(splitstackshape)
df <- data.frame(Gene = c("Gene1", "Gene2", "Gene3"), Gene_Ontology_ID = c("GO1, GO2, GO4, GO6", "GO2, GO3, GO4", "GO5, GO7"))
df <- cSplit(df, 'Gene_Ontology_ID', ',', 'long', drop = FALSE)
setDT(df)
df[, Gene := as.character(Gene)]
df[, Gene := paste0(Gene, collapse = ", "), by = Gene_Ontology_ID]
setcolorder(df, c("Gene_Ontology_ID", "Gene"))
df <- unique(df)
你会得到: -
Gene_Ontology_ID Gene
1: GO1 Gene1
2: GO2 Gene1, Gene2
3: GO4 Gene1, Gene2
4: GO6 Gene1
5: GO3 Gene2
6: GO5 Gene3
7: GO7 Gene3
答案 3 :(得分:0)
以下是仅使用基础R
的解决方案(可能效率最高):
# Obtain a vector of unique "gene ontology ids"
all_genes_id <- paste0(df$Gene_Ontology_ID, collapse = ", ")
all_genes_id <- unique(strsplit(all_genes_id, ", ")[[1]])
# Initalize and fill vector of genes per each "gene ontology ids"
genes_per_id <- vector(mode = "character", length(all_genes_id))
for(i in 1:length(all_genes_id)) {
rows_df <- grepl(all_genes_id[i], df$Gene_Ontology_ID)
genes_per_id[i] <- paste0(df$Gene[rows_df], collapse = ",")
}
# New data frame
df2 <- data.frame(Gene_Ontology_ID = all_genes_id,
Gene = genes_per_id)
df2
# Result
Gene_Ontology_ID Gene
1 GO1 Gene1
2 GO2 Gene1,Gene2
3 GO4 Gene1,Gene2
4 GO6 Gene1
5 GO3 Gene2
6 GO5 Gene3
7 GO7 Gene3