使用grep将字符串分组

时间:2017-05-04 08:40:35

标签: r

我有一组字符串,应根据其分为四个批次 字符串的最后一部分(S1-S2等)。由于空格原因,我不得不删除部分t

 > t
  [1] "001_A01_S1"   "001_A02_S2"   "001_A03_S3"   "001_A04_S4"   "001_A05_S5"   "001_A06_S6"   "001_A07_S49" 
  [8] "001_A08_S50"  "001_A09_S51"  "001_A10_S52"  "001_A11_S53"  "001_A12_S54"  "001_B01_S7"   "001_B02_S8"  
 [15] "001_B03_S9"   "001_B04_S10"  "001_B05_S11"  "001_B06_S12"  "001_B07_S55"  "001_B08_S56"  "001_B09_S57" 
 [22] "001_B10_S58"  "001_B11_S59"  "001_B12_S60"  "001_C01_S13"  "001_C02_S14"  "001_C03_S15"  "001_C04_S16" 
 [29] "001_C05_S17"  "001_C06_S18"  "001_C07_S61"  "001_C08_S62"  "001_C09_S63"  "001_C10_S64"  "001_C11_S65" 
 [36] "001_C12_S66"  "001_D01_S19"  "001_D02_S20"  "001_D03_S21"  "001_D04_S22"  "001_D05_S23"  "001_D06_S24" 
 [43] "001_D07_S67"  "001_D08_S68"  "001_D09_S69"  "001_D10_S70"  

我想将它们分为四批:

Batch1: S1-S48
Batch2: S49-S96
batch3: S97-S144
Batch4: S145-S192

这就是我的尝试:

batch <- y
batch[grep("S([1-9]|[1-3].|4[0-8])_", batch)] <- "B1"
batch[grep("S([5-8].|49|9[0-6])_", batch)] <- "B2"
batch[grep("S(1[0-3].|14[0-4]|9[7-9])_", batch)] <- "B3"
batch[!grepl("^B", batch)] <- "B4"

2 个答案:

答案 0 :(得分:4)

您可以先在12中提取字符串最后一部分的数字部分(即3S1S2S3等。然后,使用此功能,您可以使用cut进行分类。

示例

## Some sample data:
t <- c("001_A01_S1",   "001_A02_S2",  "001_A03_S3",   
        "001_A07_S49", "001_A08_S50", "001_A09_S51",  
        "001_C01_S110", "001_C02_S114", "001_C02_S128",
        "001_C01_S155", "001_C02_S159", "001_C02_S162") 
## Extract numeric part of "SXXX"
sNumericVec <-  as.numeric(stringr::str_extract(t, "(?<=_S)[[:digit:]]*"))
## Categorize:
catVec <- cut(sNumericVec, breaks = c(0,48,96,144,192))
## Rename levels:
levels(catVec) <- paste0("B", 1:4) 

catVec
#  [1] B1 B1 B1 B2 B2 B2 B3 B3 B3 B4 B4 B4
# Levels: B1 B2 B3 B4

答案 1 :(得分:3)

您可以使用cut

执行此操作
batch <- cut(as.numeric(gsub(".+S(\\d+)$","\\1",t)), #identify last numeric code
             c(0,48,96,144,192), #breakpoints for cut
             labels = c("B1","B2","B3","B4")) #names of batches