提取R中字符串中符合模式的每个元素

时间:2018-12-11 21:50:37

标签: r replace data-manipulation text-extraction

我有一个字符串,基本上是一个SQL语句。我想提取其中的一部分。 这是代码

 SELECT 
 DTE as "Date",
 CURRENT_DATE AS "Day",
 concat( BCCO, BCBCH ) AS "client/batch",
 BCSTAT as "Batch Status",
 CASE 
  WHEN EXC = 'MCR' THEN CNT 
  ELSE 0 
 END AS "MCR-NPR",
 CASE 
  WHEN EXC = 'NRC' THEN CNT 
  ELSE 0 
 END AS "NRC-NPR",
 CASE 
  WHEN EXC = 'OFD' THEN CNT 
  ELSE 0 
 END AS "OFD-NPR",
 CASE 
  WHEN EXC = 'TDB' THEN CNT 
  ELSE 0 
 END AS "TDB-NPR",
 CASE 
  WHEN EXC = 'TDC' THEN CNT 
  ELSE 0 
 END AS "TDC-NPR",
 CASE 
  WHEN EXC = 'UDC' THEN CNT 
  ELSE 0 
 END AS "UDC-NPR",
 CASE 
  WHEN EXC = 'BIN' THEN CNT 
  ELSE 0 
 END AS "BIN-WRN",
 CASE 
  WHEN EXC = 'DSP' THEN CNT 
  ELSE 0 
 END AS "DSP-WRN",

我想提取END AS和引号之间的每个元素。像(“ MCR-NPR”,...,“ DSP-WRN”)这样的向量将是期望的输出。

我知道我可能需要使用正则表达式,但是我无法提取其中的每一个。

任何想法都会受到赞赏。

最好

1 个答案:

答案 0 :(得分:2)

1)grep / read.table grepEND AS排成一行,并用read.table加上双引号sep来读取。第二列将是所需的数据。不使用正则表达式或包。

read.table(text = grep("END AS", s, value = TRUE, fixed = TRUE), 
  sep = '"', as.is = TRUE)[[2]]
## [1] "MCR-NPR" "NRC-NPR" "OFD-NPR" "TDB-NPR" "TDC-NPR" "UDC-NPR" "BIN-WRN"
## [8] "DSP-WRN"

1a):这类似于(1),但使用带有正则表达式的sub而不是read.table

sub('.*END AS "(.+)".*', "\\1", grep("END AS", s, value = TRUE))
## [1] "MCR-NPR" "NRC-NPR" "OFD-NPR" "TDB-NPR" "TDC-NPR" "UDC-NPR" "BIN-WRN"
## [8] "DSP-WRN"

2)绑紧。另一种方法如下。它利用了所需的字符串在END AS之后并用双引号引起来的事实,它具有此处显示的最短代码。

library(gsubfn)
unlist(strapplyc(s, 'END AS "(.+)"'))
## [1] "MCR-NPR" "NRC-NPR" "OFD-NPR" "TDB-NPR" "TDC-NPR" "UDC-NPR" "BIN-WRN"
## [8] "DSP-WRN"

3)捕获另一种使用与(2)中相同的模式的基本R方法是:

na.omit(strcapture('END AS "(.+)"', s, list(value = character(0))))

给予:

     value
9  MCR-NPR
13 NRC-NPR
17 OFD-NPR
21 TDB-NPR
25 TDC-NPR
29 UDC-NPR
33 BIN-WRN
37 DSP-WRN

注意

输入s以可复制的形式:

s <- 
c("SELECT ", " DTE as \"Date\",", " CURRENT_DATE AS \"Day\",", 
" concat( BCCO, BCBCH ) AS \"client/batch\",", " BCSTAT as \"Batch Status\",", 
" CASE ", "  WHEN EXC = 'MCR' THEN CNT ", "  ELSE 0 ", " END AS \"MCR-NPR\",", 
" CASE ", "  WHEN EXC = 'NRC' THEN CNT ", "  ELSE 0 ", " END AS \"NRC-NPR\",", 
" CASE ", "  WHEN EXC = 'OFD' THEN CNT ", "  ELSE 0 ", " END AS \"OFD-NPR\",", 
" CASE ", "  WHEN EXC = 'TDB' THEN CNT ", "  ELSE 0 ", " END AS \"TDB-NPR\",", 
" CASE ", "  WHEN EXC = 'TDC' THEN CNT ", "  ELSE 0 ", " END AS \"TDC-NPR\",", 
" CASE ", "  WHEN EXC = 'UDC' THEN CNT ", "  ELSE 0 ", " END AS \"UDC-NPR\",", 
" CASE ", "  WHEN EXC = 'BIN' THEN CNT ", "  ELSE 0 ", " END AS \"BIN-WRN\",", 
" CASE ", "  WHEN EXC = 'DSP' THEN CNT ", "  ELSE 0 ", " END AS \"DSP-WRN\"")