我目前有一个看起来像这样的数据框:
SampleID Chrom Start End ID
HSB275 chr1 243216377 243219494 ENST00000366542|ENSG00000143702|protein_coding|protein_coding,chr1,243216377,243219494;ENST00000366543|ENSG00000143702|protein_coding|protein_coding,chr1,243216377,243219494
HSB274 chr10 952208 979839 ENST00000381466|ENSG00000205740|antisense|processed_transcript,chr10,971146,979839
HSB272 chr10 1046378 1047984 ENST00000381344|ENSG00000067064|protein_coding|protein_coding,chr10,1046378,1047984;ENST00000491735|ENSG00000067064|processed_transcript|protein_coding,chr10,1046378,1047984;ENST00000427898|ENSG00000067064|protein_coding|protein_coding,chr10,1046378,1047984
HSB481 chr11 654157 655184 ENST00000527170|ENSG00000177030|nonsense_mediated_decay|protein_coding,chr11,654157,655184
我想做的是将ID
列缩小为“ ENSGXXXXXXX”值的列表,如果每行有多个,则用“,”定界,这样看起来就象{{ 1}}列:
所需结果:
Genes
答案 0 :(得分:2)
您没有固定的定界符,但是可以使用strpslit
将ID
列拆分为各种定界符(,
,;
,|
),那么对于每个元素,仅保留以“ ENSG”开头的值,然后删除其他值。
sapply(strsplit(df$ID, ",|\\||;"),
function(x) toString(grep("^ENSG", x, value = TRUE)))
#[1] "ENSG00000143702, ENSG00000143702"
#[2] "ENSG00000205740"
#[3] "ENSG00000067064, ENSG00000067064, ENSG00000067064"
#[4] "ENSG00000177030"
答案 1 :(得分:0)
这是一个tidyverse
选项
library(tidyverse)
df %>%
mutate(Genes = map_chr(str_split(ID, ";"), ~toString(map(str_split(.x, "\\|"), 2)))) %>%
select(-ID)
# SampleID Chrom Start End
#1 HSB275 chr1 243216377 243219494
#2 HSB274 chr10 952208 979839
#3 HSB272 chr10 1046378 1047984
#4 HSB481 chr11 654157 655184
# Genes
#1 ENSG00000143702, ENSG00000143702
#2 ENSG00000205740
#3 ENSG00000067064, ENSG00000067064, ENSG00000067064
#4 ENSG00000177030
df <- read.table(text =
"SampleID Chrom Start End ID
HSB275 chr1 243216377 243219494 ENST00000366542|ENSG00000143702|protein_coding|protein_coding,chr1,243216377,243219494;ENST00000366543|ENSG00000143702|protein_coding|protein_coding,chr1,243216377,243219494
HSB274 chr10 952208 979839 ENST00000381466|ENSG00000205740|antisense|processed_transcript,chr10,971146,979839
HSB272 chr10 1046378 1047984 ENST00000381344|ENSG00000067064|protein_coding|protein_coding,chr10,1046378,1047984;ENST00000491735|ENSG00000067064|processed_transcript|protein_coding,chr10,1046378,1047984;ENST00000427898|ENSG00000067064|protein_coding|protein_coding,chr10,1046378,1047984
HSB481 chr11 654157 655184 ENST00000527170|ENSG00000177030|nonsense_mediated_decay|protein_coding,chr11,654157,655184", header = T)
答案 2 :(得分:0)
library(dplyr)
library(stringr) #str_extract_all
df %>% group_by(SampleID) %>% #Use rowwise() if you do not like group_by
mutate(Genes = paste(str_extract_all(ID, 'ENSG\\d+',simplify = T),collapse = ',')) %>%
select(-ID)
# A tibble: 4 x 5
# Groups: SampleID [4]
SampleID Chrom Start End Genes
<fct> <fct> <int> <int> <chr>
1 HSB275 chr1 243216377 243219494 ENSG00000143702,ENSG00000143702
2 HSB274 chr10 952208 979839 ENSG00000205740
3 HSB272 chr10 1046378 1047984 ENSG00000067064,ENSG00000067064,ENSG00000067064
4 HSB481 chr11 654157 655184 ENSG00000177030
答案 3 :(得分:0)
我的尝试
genes %>%
mutate_at(vars(ID), funs(str_extract_all(., "ENSG[:digit:]*") %>%
str_replace_all("c|\"|\\(|\\)", "")))
# A tibble: 4 x 5
SampleID Chrom Start End ID
<chr> <chr> <dbl> <dbl> <chr>
1 HSB275 chr1 243216377 243219494 ENSG00000143702, ENSG00000143702
2 HSB274 chr10 952208 979839 ENSG00000205740
3 HSB272 chr10 1046378 1047984 ENSG00000067064, ENSG00000067064, ENSG00000067064
4 HSB481 chr11 654157 655184 ENSG00000177030
这会找到与ENSG<any length of numeric characters>
匹配的任何模式,然后将列表强制为相关字符串的向量,并整理所有不需要的字符。
尽管本人本着整洁的数据精神,但我还是将每个“ ID”放在单独的列中,复制了相关的SampleID / Chrom / Start / End数据。
答案 4 :(得分:0)
数据
df <- read.table(text =
"SampleID Chrom Start End ID
HSB275 chr1 243216377 243219494 ENST00000366542|ENSG00000143702|protein_coding|protein_coding,chr1,243216377,243219494;ENST00000366543|ENSG00000143702|protein_coding|protein_coding,chr1,243216377,243219494
HSB274 chr10 952208 979839 ENST00000381466|ENSG00000205740|antisense|processed_transcript,chr10,971146,979839
HSB272 chr10 1046378 1047984 ENST00000381344|ENSG00000067064|protein_coding|protein_coding,chr10,1046378,1047984;ENST00000491735|ENSG00000067064|processed_transcript|protein_coding,chr10,1046378,1047984;ENST00000427898|ENSG00000067064|protein_coding|protein_coding,chr10,1046378,1047984
HSB481 chr11 654157 655184 ENST00000527170|ENSG00000177030|nonsense_mediated_decay|protein_coding,chr11,654157,655184", header = T)
我的解决方案
我定义了两个可能在将来用于解决此任务的功能:
第一个extract_matches
提取str.vec
中每个元素的所有匹配项。它返回与模式匹配的所有匹配子字符串的列表。它包装gregexpr
,它仅返回比赛的位置信息。
第二个extract_matches_aggregating
始终返回一个向量,因为它使用sep=
连接所有找到的匹配项。它取决于extract_matches
。
您可以使用这两个功能提取所有ENSG ID并通过“,”将它们链接起来。
extract_matches <- function(pattern, str.vec) {
Map(function(m, s) substring(s, m, m + attr(m, "match.length") - 1), gregexpr(pattern, str.vec), str.vec)
}
extract_matches_aggregating <- function(pattern, str.vec, sep = "; ") {
sapply(extract_matches(pattern, str.vec), function(res_vec) {
paste(res_vec, collapse = sep)})
}
df$ID <- extract_matches_aggregating(pattern = "ENSG\\d+", str.vec = df$ID, sep = ", ")
df
然后是:
SampleID Chrom Start End
1 HSB275 chr1 243216377 243219494
2 HSB274 chr10 952208 979839
3 HSB272 chr10 1046378 1047984
4 HSB481 chr11 654157 655184
ID
1 ENSG00000143702, ENSG00000143702
2 ENSG00000205740
3 ENSG00000067064, ENSG00000067064, ENSG00000067064
4 ENSG00000177030
在大型表上,此解决方案将比使用strsplit
和sapply
和lapply
的解决方案更快。