更新

Question

我正在尝试提取以下文本中“高”关键字旁边的数字值（带粗体的项目）。但是我遇到了

错误

“ stri_extract_first_regex（字符串，模式，opts_regex = opts（pattern））中的错误：后向模式匹配必须具有有界的最大长度。（U_REGEX_LOOK_BEHIND_LIMIT）”

我使用的正则表达式是

"(?<=High\\s*>?=?\\s?)[\\d\\.]+[\\s\\-\\d\\.]+(?=\\s)").

这在在线正则表达式测试器中有效，但是当我在Rstudio中执行相同操作时，出现上述错误

文本为

 Optimal             <2.6  Desirable           2.6 - 3.3  Borderline high     3.4 - 4.0  High                ***4.1 - 4.8***  Very high           >=4.9

 Desirable       <5.2  Borderline high 5.2 - 6.1  High            >= ***6.2***

 Desirable   <1.7  Borderline High 1.7 - 2.2  High      ***2.3 - 4.4***  Very high >=4.5

请注意，我在R语言中使用了双斜杠。但是在这里，它只显示一个斜杠

你能帮我吗？

Answer 1

样本数据

我将一个“ Borderline High”更改为“ Borderline high”。可能是错字。

v <- c("Optimal             <2.6  Desirable           2.6 - 3.3  Borderline high     3.4 - 4.0  High                4.1 - 4.8  Very high           >=4.9",
       "Desirable       <5.2  Borderline high 5.2 - 6.1  High            >= 6.2",
         "Desirable   <1.7  Borderline high 1.7 - 2.2  High      2.3 - 4.4  Very high >=4.5")

代码

library(dplyr)
library(stringr)
data.frame( text = v, stringsAsFactors = FALSE ) %>%
  #Extract text between "High" and "Very", trim whirespace
  dplyr::mutate( High = trimws( stringr::str_extract(text, "(?<=High).*(?=Very)") ) ) %>%
  #If no text was extracted, take everything after "High" until the end
  dplyr::mutate( High = ifelse( is.na( High ), trimws( stringr::str_extract(text, "(?<=High).*(?=$)") ), High ) ) %>%
  dplyr::select( High )

输出

#        High
# 1 4.1 - 4.8
# 2    >= 6.2
# 3 2.3 - 4.4

更新

如果High前没有{strong>不是，则仅取High之后的值。

[a-zA-Z]

data.frame( text = v, stringsAsFactors = FALSE ) %>%
  #Extract text between "High" and "Very", trim whirespace
  dplyr::mutate( High = trimws( stringr::str_extract(text, "(?<=[^a-zA-Z] High).*(?=Very)") ) ) %>%
  #If no text was extracted, take everything after "High" until the end
  dplyr::mutate( High = ifelse( is.na( High ), trimws( stringr::str_extract(text, "(?<=[^a-zA-Z] High).*(?=$)") ), High ) ) %>%
  dplyr::select( High )

正则表达式落后于R中的限制

1 个答案:

更新