表分离和子集

时间:2018-01-25 19:35:00

标签: r string subset

我希望你能帮助我解决这个问题我从StarkLab系列增强器施工记者那里获得了以下数据框架。

Lines <- data.frame(VTID = c("VT0006", "VT0007", "VT0112") , 
                    Chr = c("chr2L", "chr3R", "chr3L"), 
                    pattern = c("ubitquitous;4", "procephalic_ectoderm_AISN;4|posterior;3", "dorsal ectoderm anlage; 4|posterior_endoderm_AISN;2|posterior_endoderm_AISN;2" )

我想从列pattern中获取;之后值较高的字符串,我不知道怎么做,因为一些字符串也除以{{1} }有两个|其他有3 |。如果您对如何对所有字符串进行分类有最好的建议,我可以提出建议。你会如何解决它?

3 个答案:

答案 0 :(得分:0)

您可以使用regmatches捕获字符串

 sapply(regmatches(Lines$pattern,gregexpr(".*?\\d",Lines$pattern)),function(x)x[which.max(sub("\\D*","",x))])
[1] "ubitquitous;4"               "procephalic_ectoderm_AISN;4" "dorsal ectoderm anlage; 4"

您还可以使用评论

中给出的strsplit功能
sapply(strsplit(as.character(Lines$pattern),"\\|"),function(x)x[which.max(sub("\\D*","",x))])
[1] "ubitquitous;4"               "procephalic_ectoderm_AISN;4" "dorsal ectoderm anlage; 4"  

答案 1 :(得分:0)

Lines$pattern <- as.character(Lines$pattern)
Lines$values <- gsub("\\D*", ",", Lines$pattern)

get_highest <- function(x) {
  split <- strsplit(x, ",")
  split <- unlist(lapply(split, function(s) max(as.numeric(s[s != ""]))))
  return(split)
}

Lines$max <- get_highest(Lines$values)
Lines$regex <-paste0("(",paste0(".*", Lines$max),").*")
Lines$final <- apply(Lines, 1, function(x) gsub(x[6], "\\1", x[3]))

Lines$final 
# "ubitquitous;4"               "procephalic_ectoderm_AISN;4" "dorsal ectoderm anlage; 4" 

答案 2 :(得分:0)

虽然您可以使用正则表达式直接提取信息,但更强大的技术是整理数据:

library(tidyverse)

Lines <- data.frame(VTID = c("VT0006", "VT0007", "VT0112") , 
                    Chr = c("chr2L", "chr3R", "chr3L"), 
                    pattern = c("ubitquitous;4", "procephalic_ectoderm_AISN;4|posterior;3", "dorsal ectoderm anlage; 4|posterior_endoderm_AISN;2|posterior_endoderm_AISN;2" ),
                    stringsAsFactors = FALSE)

df_lines <- Lines %>% 
    separate_rows(pattern, sep = '\\|') %>% 
    separate(pattern, c('key', 'value'), sep = ';', convert = TRUE)

df_lines
#>     VTID   Chr                       key value
#> 1 VT0006 chr2L               ubitquitous     4
#> 2 VT0007 chr3R procephalic_ectoderm_AISN     4
#> 3 VT0007 chr3R                 posterior     3
#> 4 VT0112 chr3L    dorsal ectoderm anlage     4
#> 5 VT0112 chr3L   posterior_endoderm_AISN     2
#> 6 VT0112 chr3L   posterior_endoderm_AISN     2

之后,子集化或聚合是微不足道的:

df_lines %>% group_by(VTID, Chr) %>% top_n(1, value)
#> # A tibble: 3 x 4
#> # Groups:   VTID, Chr [3]
#>   VTID   Chr   key                       value
#>   <chr>  <chr> <chr>                     <int>
#> 1 VT0006 chr2L ubitquitous                   4
#> 2 VT0007 chr3R procephalic_ectoderm_AISN     4
#> 3 VT0112 chr3L dorsal ectoderm anlage        4

如果您不想将数据拆分为多行,则可以插入数据框的列表列。与他们合作将会有更多的工作,但有时成语是有道理的。在基地R,

Lines$data <- lapply(strsplit(Lines$pattern, '\\|'), function(x){
    read.csv2(text = paste(x, collapse = '\n'), 
              header = FALSE, col.names = c('key', 'value'), stringsAsFactors = FALSE)
})

str(Lines)
#> 'data.frame':    3 obs. of  4 variables:
#>  $ VTID   : chr  "VT0006" "VT0007" "VT0112"
#>  $ Chr    : chr  "chr2L" "chr3R" "chr3L"
#>  $ pattern: chr  "ubitquitous;4" "procephalic_ectoderm_AISN;4|posterior;3" "dorsal ectoderm anlage; 4|posterior_endoderm_AISN;2|posterior_endoderm_AISN;2"
#>  $ data   :List of 3
#>   ..$ :'data.frame': 1 obs. of  2 variables:
#>   .. ..$ key  : chr "ubitquitous"
#>   .. ..$ value: int 4
#>   ..$ :'data.frame': 2 obs. of  2 variables:
#>   .. ..$ key  : chr  "procephalic_ectoderm_AISN" "posterior"
#>   .. ..$ value: int  4 3
#>   ..$ :'data.frame': 3 obs. of  2 variables:
#>   .. ..$ key  : chr  "dorsal ectoderm anlage" "posterior_endoderm_AISN" "posterior_endoderm_AISN"
#>   .. ..$ value: int  4 2 2