什么正则表达式会让我失望地找到一个单词之前或之后的最接近的数字

时间:2019-01-28 08:45:08

标签: r regex

我的句子如下

"There is a 10cm length of Barrett's"
"The length of Barrett's is around 5 cm"
"The Barrett's measures 10cm in length above a 4cm hiatus hernia"
"The length of Barrett's is 5cm but the length of the dysplasia is 3cm"

我要提取ififse语句中Barrett的长度

    ifelse(grepl("(\\.|^)(?=[^\\.]*cm)(?=[^\\.]*Barr)(?=[^\\.]*(of |length))[^\\.]*(\\.|$)", 
dataframe[,EndoReportColumn], perl=TRUE,ignore.case = TRUE),
stringr::str_extract(stringr::str_match(dataframe[,EndoReportColumn],"(\\.|^)(?=[^\\.]*cm)(?=[^\\.]*[Bb]arr)[^\\.]*(\\.|$)"),"\\d"),"None Found")

我的问题是,如果句子中有两个数字,那么提取的数字是不正确的,所以我得到的结果是:

10
5
4
3

我如何获得在句子中同时包含长度和Barrett的句子中最接近(在Barrett之前或之后)的数字?

2 个答案:

答案 0 :(得分:1)

尝试此正则表达式:

(\d+\s*\w+)[^\d\r\n]*Barret|[^\d\r\n]*Barret[^\d\r\n]*(\d+\s*\w+)

Click for Demo

通过一些编程,您可以提取Group 1/2的内容。

注意: 设计该解决方案时要注意所提供的示例字符串。另外,将每个\换成另一个\

(\\d+\\s*\\w+)[^\\d\\r\\n]*Barret|[^\\d\\r\\n]*Barret[^\\d\\r\\n]*(\\d+\\s*\\w+)

说明:

  • (\d+\s*\w+)-匹配1+个数字,然后匹配0+空格,再匹配1+个单词字符,以匹配并捕获长度及其在 Group 1
  • 中的单位
  • [^\d\r\n]*Barret-匹配0+次出现的既不是换行符也不是回车符或数字后跟单词Barret
  • 的任何字符
  • |-或
  • [^\d\r\n]*Barret[^\d\r\n]*-匹配0+次出现的既不是换行符也不是回车符或数字后跟Barret的任何字符。紧随其后的是0+次出现的既不是换行符也不是回车符也不是数字的任何字符,后跟单词Barret
  • (\d+\s*\w+)-匹配1+个数字,后跟0+空格,再匹配1+个单词字符,以匹配并捕获长度及其在 Group 2
  • 中的单位

答案 1 :(得分:1)

这可能不是最佳/最短/最快的答案, 但是它提供了所需的结果,并且可以在数据变得更加复杂时轻松扩展。.

样本数据

vec <- c( "There is a 10cm length of Barrett's",
"The length of Barrett's is around 5 cm",
"The Barrett's measures 10cm in length above a 4cm hiatus hernia",
"The length of Barrett's is 5cm but the length of the dysplasia is 3cm")

代码

library( tidyverse )

l <- lapply( vec, function(x) {
  data.frame( value = as.numeric( unlist( str_extract_all( x, "[0-9]+" ) ) ),
              position = as.numeric( unlist( gregexpr( "[0-9]+", x) ) ) )
  })
matches <- as.data.frame( data.table::rbindlist(l, idcol = "id" ) )

df <- data.frame( text = vec, stringsAsFactors = FALSE )
pattern_ <-"Barrett's"

library( tidyverse )
df %>%
  mutate( id = row_number(),
          start_barrett = regexpr( pattern_, text),
          end_barrett = start_barrett + nchar( pattern_ ) ) %>%
  left_join( matches, by = "id" ) %>%
  mutate( distance = ifelse( position > start_barrett, position - end_barrett, start_barrett - position ) ) %>%
  group_by( id ) %>%
  arrange( distance ) %>%
  slice( 1L ) %>%
  ungroup() %>%
  select( text, value )

输出

# # A tibble: 4 x 2
#   text                                                                  value
#   <chr>                                                                 <dbl>
# 1 There is a 10cm length of Barrett's                                      10
# 2 The length of Barrett's is around 5 cm                                    5
# 3 The Barrett's measures 10cm in length above a 4cm hiatus hernia          10
# 4 The length of Barrett's is 5cm but the length of the dysplasia is 3cm     5