正则表达式落后于R中的限制

时间:2019-03-18 03:53:20

标签: r regex regex-lookarounds

我正在尝试提取以下文本中“高”关键字旁边的数字值(带粗体的项目)。但是我遇到了

错误
  

“ stri_extract_first_regex(字符串,模式,opts_regex = opts(pattern))中的错误:     后向模式匹配必须具有有界的最大长度。 (U_REGEX_LOOK_BEHIND_LIMIT)”

我使用的正则表达式是

"(?<=High\\s*>?=?\\s?)[\\d\\.]+[\\s\\-\\d\\.]+(?=\\s)").

这在在线正则表达式测试器中有效,但是当我在Rstudio中执行相同操作时,出现上述错误

  

文本为

 Optimal             <2.6  Desirable           2.6 - 3.3  Borderline high     3.4 - 4.0  High                ***4.1 - 4.8***  Very high           >=4.9

 Desirable       <5.2  Borderline high 5.2 - 6.1  High            >= ***6.2***

 Desirable   <1.7  Borderline High 1.7 - 2.2  High      ***2.3 - 4.4***  Very high >=4.5

请注意,我在R语言中使用了双斜杠。但是在这里,它只显示一个斜杠

你能帮我吗?

1 个答案:

答案 0 :(得分:0)

样本数据

我将一个“ Borderline High”更改为“ Borderline high”。可能是错字。

v <- c("Optimal             <2.6  Desirable           2.6 - 3.3  Borderline high     3.4 - 4.0  High                4.1 - 4.8  Very high           >=4.9",
       "Desirable       <5.2  Borderline high 5.2 - 6.1  High            >= 6.2",
         "Desirable   <1.7  Borderline high 1.7 - 2.2  High      2.3 - 4.4  Very high >=4.5")

代码

library(dplyr)
library(stringr)
data.frame( text = v, stringsAsFactors = FALSE ) %>%
  #Extract text between "High" and "Very", trim whirespace
  dplyr::mutate( High = trimws( stringr::str_extract(text, "(?<=High).*(?=Very)") ) ) %>%
  #If no text was extracted, take everything after "High" until the end
  dplyr::mutate( High = ifelse( is.na( High ), trimws( stringr::str_extract(text, "(?<=High).*(?=$)") ), High ) ) %>%
  dplyr::select( High )

输出

#        High
# 1 4.1 - 4.8
# 2    >= 6.2
# 3 2.3 - 4.4

更新

如果High前没有{strong>不是,则仅取High之后的值。

[a-zA-Z]
data.frame( text = v, stringsAsFactors = FALSE ) %>%
  #Extract text between "High" and "Very", trim whirespace
  dplyr::mutate( High = trimws( stringr::str_extract(text, "(?<=[^a-zA-Z] High).*(?=Very)") ) ) %>%
  #If no text was extracted, take everything after "High" until the end
  dplyr::mutate( High = ifelse( is.na( High ), trimws( stringr::str_extract(text, "(?<=[^a-zA-Z] High).*(?=$)") ), High ) ) %>%
  dplyr::select( High )