我希望你能帮助我解决这个问题我从StarkLab系列增强器施工记者那里获得了以下数据框架。
Lines <- data.frame(VTID = c("VT0006", "VT0007", "VT0112") ,
Chr = c("chr2L", "chr3R", "chr3L"),
pattern = c("ubitquitous;4", "procephalic_ectoderm_AISN;4|posterior;3", "dorsal ectoderm anlage; 4|posterior_endoderm_AISN;2|posterior_endoderm_AISN;2" )
我想从列pattern
中获取;
之后值较高的字符串,我不知道怎么做,因为一些字符串也除以{{1} }有两个|
其他有3 |
。如果您对如何对所有字符串进行分类有最好的建议,我可以提出建议。你会如何解决它?
答案 0 :(得分:0)
您可以使用regmatches
捕获字符串
sapply(regmatches(Lines$pattern,gregexpr(".*?\\d",Lines$pattern)),function(x)x[which.max(sub("\\D*","",x))])
[1] "ubitquitous;4" "procephalic_ectoderm_AISN;4" "dorsal ectoderm anlage; 4"
您还可以使用评论
中给出的strsplit
功能
sapply(strsplit(as.character(Lines$pattern),"\\|"),function(x)x[which.max(sub("\\D*","",x))])
[1] "ubitquitous;4" "procephalic_ectoderm_AISN;4" "dorsal ectoderm anlage; 4"
答案 1 :(得分:0)
Lines$pattern <- as.character(Lines$pattern)
Lines$values <- gsub("\\D*", ",", Lines$pattern)
get_highest <- function(x) {
split <- strsplit(x, ",")
split <- unlist(lapply(split, function(s) max(as.numeric(s[s != ""]))))
return(split)
}
Lines$max <- get_highest(Lines$values)
Lines$regex <-paste0("(",paste0(".*", Lines$max),").*")
Lines$final <- apply(Lines, 1, function(x) gsub(x[6], "\\1", x[3]))
Lines$final
# "ubitquitous;4" "procephalic_ectoderm_AISN;4" "dorsal ectoderm anlage; 4"
答案 2 :(得分:0)
虽然您可以使用正则表达式直接提取信息,但更强大的技术是整理数据:
library(tidyverse)
Lines <- data.frame(VTID = c("VT0006", "VT0007", "VT0112") ,
Chr = c("chr2L", "chr3R", "chr3L"),
pattern = c("ubitquitous;4", "procephalic_ectoderm_AISN;4|posterior;3", "dorsal ectoderm anlage; 4|posterior_endoderm_AISN;2|posterior_endoderm_AISN;2" ),
stringsAsFactors = FALSE)
df_lines <- Lines %>%
separate_rows(pattern, sep = '\\|') %>%
separate(pattern, c('key', 'value'), sep = ';', convert = TRUE)
df_lines
#> VTID Chr key value
#> 1 VT0006 chr2L ubitquitous 4
#> 2 VT0007 chr3R procephalic_ectoderm_AISN 4
#> 3 VT0007 chr3R posterior 3
#> 4 VT0112 chr3L dorsal ectoderm anlage 4
#> 5 VT0112 chr3L posterior_endoderm_AISN 2
#> 6 VT0112 chr3L posterior_endoderm_AISN 2
之后,子集化或聚合是微不足道的:
df_lines %>% group_by(VTID, Chr) %>% top_n(1, value)
#> # A tibble: 3 x 4
#> # Groups: VTID, Chr [3]
#> VTID Chr key value
#> <chr> <chr> <chr> <int>
#> 1 VT0006 chr2L ubitquitous 4
#> 2 VT0007 chr3R procephalic_ectoderm_AISN 4
#> 3 VT0112 chr3L dorsal ectoderm anlage 4
如果您不想将数据拆分为多行,则可以插入数据框的列表列。与他们合作将会有更多的工作,但有时成语是有道理的。在基地R,
Lines$data <- lapply(strsplit(Lines$pattern, '\\|'), function(x){
read.csv2(text = paste(x, collapse = '\n'),
header = FALSE, col.names = c('key', 'value'), stringsAsFactors = FALSE)
})
str(Lines)
#> 'data.frame': 3 obs. of 4 variables:
#> $ VTID : chr "VT0006" "VT0007" "VT0112"
#> $ Chr : chr "chr2L" "chr3R" "chr3L"
#> $ pattern: chr "ubitquitous;4" "procephalic_ectoderm_AISN;4|posterior;3" "dorsal ectoderm anlage; 4|posterior_endoderm_AISN;2|posterior_endoderm_AISN;2"
#> $ data :List of 3
#> ..$ :'data.frame': 1 obs. of 2 variables:
#> .. ..$ key : chr "ubitquitous"
#> .. ..$ value: int 4
#> ..$ :'data.frame': 2 obs. of 2 variables:
#> .. ..$ key : chr "procephalic_ectoderm_AISN" "posterior"
#> .. ..$ value: int 4 3
#> ..$ :'data.frame': 3 obs. of 2 variables:
#> .. ..$ key : chr "dorsal ectoderm anlage" "posterior_endoderm_AISN" "posterior_endoderm_AISN"
#> .. ..$ value: int 4 2 2