从字符串的最内层嵌套括号中提取文本

时间:2017-06-16 15:55:28

标签: r regex string-substitution

从下面的文本字符串中,我试图提取特定的字符串子集。

string <- c("(Intercept)", "scale(AspectCos_30)", "scale(CanCov_500)", 
            "scale(DST50_30)", "scale(Ele_30)", "scale(NDVI_Tin_250)", "scale(Slope_500)", 
            "I(scale(Slope_500)^2)", "scale(SlopeVar_30)", "scale(CanCov_1000)", 
            "scale(NDVI_Tin_1000)", "scale(Slope_1000)", "I(scale(Slope_1000)^2)", 
            "scale(log(SlopeVar_30 + 0.001))", "scale(CanCov_30)", "scale(Slope_30)", 
            "I(scale(Slope_30)^2)")

一个好的结果将返回没有任何特殊字符的中心文本,如下所示。

Good <- c("Intercept", "AspectCos", "CanCov", "DST50", "Ele", "NDVI", "Slope", "Slope",
            "SlopeVar", "CanCov", "NDVI", "Slope", "Slope", "SlopeVar", "CanCov" "Slope", "Slope")

然而,优选地,得到的字符串将考虑与&#39; Slope&#39;相关联的^2log。和&#39; SlopeVar&#39;分别。具体来说,包含^2的所有字符串都将转换为&#39; SlopeSq&#39;并且包含log的所有字符串都将转换为&#39; SlopeVarPs&#39;,如下所示。

Best <- c("Intercept", "AspectCos", "CanCov", "DST50", "Ele", "NDVI", "Slope", "SlopeSq",
          "SlopeVar", "CanCov", "NDVI", "Slope", "SlopeSq", "SlopeVarPs", "CanCov" "Slope", "SlopeSq")

我有一个冗长,丑陋且效率低下的代码序列,这使我几乎达到了良好结果的一半,并且会感激任何建议。

2 个答案:

答案 0 :(得分:3)

作为一个效率不高的编码器,我喜欢使用多个正则表达式的链来实现结果(每行中正则表达式的每一行都被注释):

library(stringr)
library(dplyr)
string %>% 
  str_replace_all(".*log\\((.*?)(_.+?)?\\).*", "\\1Ps") %>% # deal with "log" entry
  str_replace_all(".*\\((.*?\\))", "\\1") %>% # delete anything before the last "(" 
  str_replace_all("(_\\d+)?\\)\\^2", "Sq") %>%  # take care of ^2
  str_replace_all("(_.+)?\\)?", "") -> "outcome" # remove extra characters in the end (e.g. "_00" and ")")


Best <- c("Intercept", "AspectCos", "CanCov", "DST50", "Ele", "NDVI", "Slope", "SlopeSq",
          "SlopeVar", "CanCov", "NDVI", "Slope", "SlopeSq", "SlopeVarPs", "CanCov","Slope", "SlopeSq")
all(outcome == Best)
## TRUE

答案 1 :(得分:1)

我认为可以通过https://i.imgur.com/bWYGzzz.png软件包来实现。

首先,因为要在最里面的括号内包含“中央文本”。因此,以下正则表达式排除了括号内包含括号的所有文本。但是我保留了“ log / ^ 2”以备后用。

string_step <- str_extract(string,
                           "(log|)\\([^()]+\\)(\\^2|)")

然后,我注意到下划线后的所有内容均被截断,但仅保留了alphebat(和数字)短语。 stringr,因此被使用。

GoodMy <-
  str_extract(str_replace_all(string_step, "log|\\(|\\)|\\^2", ""),
              "[:alnum:]+")

BestMy <-
  paste0(Good, as.character(sapply(string_step, function(x) {
    if (str_detect(x, "log")) {
      "Ps"
    } else if (str_detect(x, "\\^2")) {
      "Sq"
    } else {
      ""
    }
  })))

all(Good == GoodMy, Best == BestMy) #yields True