如何将包含字符串列的列转换为新的唯一标识符列

时间:2018-01-11 20:30:06

标签: r unix text-processing

道歉,如果之前有过类似的问题,我无法找到,可能是因为问题的措辞。

一些当前样本数据如下所示,其中第一列是标识符(基因)列表,第二列是一组描述符(基因本体ID):

Gene    Gene_Ontology_ID
Gene1   GO1, GO2, GO4, GO6
Gene2   GO2, GO3, GO4
Gene3   GO5, GO7

我想知道是否有一种有效的方法来转换类似格式的大表,以便" Gene_Ontology_ID"列现在用作识别列," Gene" column现在是Gene_Ontology_ID的基因列表,如下所示:

Gene_Ontology_ID    Gene
GO1                 Gene1
GO2                 Gene1,Gene2
GO3                 Gene2
GO4                 Gene1,Gene2
GO5                 Gene3
GO6                 Gene1
GO7                 Gene3

是否有解决方案,最好使用Unix,Python或R?非常感谢任何帮助,谢谢。

4 个答案:

答案 0 :(得分:1)

library(dplyr)
library(tidyr)
out <- df %>% separate(Gene_Ontology_ID, into=paste("genes", 1:7, sep = "_"),sep =", ",fill="right") %>% 
  gather(key,Gene_Ontology_ID, -Gene,na.rm=TRUE) %>%
  arrange(Gene_Ontology_ID,Gene) %>%
  group_by(Gene_Ontology_ID) %>% 
  summarise(Gene= paste(Gene,collapse =", "))

out
# A tibble: 7 x 2
  Gene_Ontology_ID         Gene
             <chr>        <chr>
1              GO1        Gene1
2              GO2 Gene1, Gene2
3              GO3        Gene2
4              GO4 Gene1, Gene2
5              GO5        Gene3
6              GO6        Gene1
7              GO7        Gene3

答案 1 :(得分:0)

strsplitunnest的类似选项:

library(dplyr)
library(tidyr)

df %>% 
  mutate(Gene_Ontology_ID = strsplit(Gene_Ontology_ID, ", ")) %>%
  unnest(Gene_Ontology_ID) %>% 
  group_by(Gene_Ontology_ID) %>% 
  summarise(Gene = paste(Gene, collapse = ", "))

# A tibble: 7 x 2
  Gene_Ontology_ID Gene        
  <chr>            <chr>       
1 GO1              Gene1       
2 GO2              Gene1, Gene2
3 GO3              Gene2       
4 GO4              Gene1, Gene2
5 GO5              Gene3       
6 GO6              Gene1       
7 GO7              Gene3      

答案 2 :(得分:0)

使用data.table中的cSplitsplitstackshape: -

library(data.table)
library(splitstackshape)
df <- data.frame(Gene = c("Gene1", "Gene2", "Gene3"), Gene_Ontology_ID = c("GO1, GO2, GO4, GO6", "GO2, GO3, GO4", "GO5, GO7"))
df <- cSplit(df, 'Gene_Ontology_ID', ',', 'long', drop = FALSE)
setDT(df)
df[, Gene := as.character(Gene)]
df[, Gene := paste0(Gene, collapse = ", "), by = Gene_Ontology_ID]
setcolorder(df, c("Gene_Ontology_ID", "Gene"))
df <- unique(df)

你会得到: -

 Gene_Ontology_ID         Gene
1:              GO1        Gene1
2:              GO2 Gene1, Gene2
3:              GO4 Gene1, Gene2
4:              GO6        Gene1
5:              GO3        Gene2
6:              GO5        Gene3
7:              GO7        Gene3

答案 3 :(得分:0)

以下是仅使用基础R的解决方案(可能效率最高):

# Obtain a vector of unique "gene ontology ids"
all_genes_id <- paste0(df$Gene_Ontology_ID, collapse = ", ")
all_genes_id <- unique(strsplit(all_genes_id, ", ")[[1]])

# Initalize and fill vector of genes per each "gene ontology ids"
genes_per_id <- vector(mode = "character", length(all_genes_id))
for(i in 1:length(all_genes_id)) {
  rows_df <- grepl(all_genes_id[i], df$Gene_Ontology_ID)
  genes_per_id[i] <- paste0(df$Gene[rows_df], collapse = ",")
}

# New data frame
df2 <- data.frame(Gene_Ontology_ID = all_genes_id,
                  Gene = genes_per_id)
df2
# Result
  Gene_Ontology_ID        Gene
1              GO1       Gene1
2              GO2 Gene1,Gene2
3              GO4 Gene1,Gene2
4              GO6       Gene1
5              GO3       Gene2
6              GO5       Gene3
7              GO7       Gene3