我的句子如下
"There is a 10cm length of Barrett's"
"The length of Barrett's is around 5 cm"
"The Barrett's measures 10cm in length above a 4cm hiatus hernia"
"The length of Barrett's is 5cm but the length of the dysplasia is 3cm"
我要提取ififse语句中Barrett的长度
ifelse(grepl("(\\.|^)(?=[^\\.]*cm)(?=[^\\.]*Barr)(?=[^\\.]*(of |length))[^\\.]*(\\.|$)",
dataframe[,EndoReportColumn], perl=TRUE,ignore.case = TRUE),
stringr::str_extract(stringr::str_match(dataframe[,EndoReportColumn],"(\\.|^)(?=[^\\.]*cm)(?=[^\\.]*[Bb]arr)[^\\.]*(\\.|$)"),"\\d"),"None Found")
我的问题是,如果句子中有两个数字,那么提取的数字是不正确的,所以我得到的结果是:
10
5
4
3
我如何获得在句子中同时包含长度和Barrett的句子中最接近(在Barrett之前或之后)的数字?
答案 0 :(得分:1)
尝试此正则表达式:
(\d+\s*\w+)[^\d\r\n]*Barret|[^\d\r\n]*Barret[^\d\r\n]*(\d+\s*\w+)
通过一些编程,您可以提取Group 1/2的内容。
注意:
设计该解决方案时要注意所提供的示例字符串。另外,将每个\
换成另一个\
。
(\\d+\\s*\\w+)[^\\d\\r\\n]*Barret|[^\\d\\r\\n]*Barret[^\\d\\r\\n]*(\\d+\\s*\\w+)
说明:
(\d+\s*\w+)
-匹配1+个数字,然后匹配0+空格,再匹配1+个单词字符,以匹配并捕获长度及其在 Group 1 [^\d\r\n]*Barret
-匹配0+次出现的既不是换行符也不是回车符或数字后跟单词Barret
|
-或[^\d\r\n]*Barret[^\d\r\n]*
-匹配0+次出现的既不是换行符也不是回车符或数字后跟Barret
的任何字符。紧随其后的是0+次出现的既不是换行符也不是回车符也不是数字的任何字符,后跟单词Barret
。(\d+\s*\w+)
-匹配1+个数字,后跟0+空格,再匹配1+个单词字符,以匹配并捕获长度及其在 Group 2 答案 1 :(得分:1)
这可能不是最佳/最短/最快的答案, 但是它提供了所需的结果,并且可以在数据变得更加复杂时轻松扩展。.
样本数据
vec <- c( "There is a 10cm length of Barrett's",
"The length of Barrett's is around 5 cm",
"The Barrett's measures 10cm in length above a 4cm hiatus hernia",
"The length of Barrett's is 5cm but the length of the dysplasia is 3cm")
代码
library( tidyverse )
l <- lapply( vec, function(x) {
data.frame( value = as.numeric( unlist( str_extract_all( x, "[0-9]+" ) ) ),
position = as.numeric( unlist( gregexpr( "[0-9]+", x) ) ) )
})
matches <- as.data.frame( data.table::rbindlist(l, idcol = "id" ) )
df <- data.frame( text = vec, stringsAsFactors = FALSE )
pattern_ <-"Barrett's"
library( tidyverse )
df %>%
mutate( id = row_number(),
start_barrett = regexpr( pattern_, text),
end_barrett = start_barrett + nchar( pattern_ ) ) %>%
left_join( matches, by = "id" ) %>%
mutate( distance = ifelse( position > start_barrett, position - end_barrett, start_barrett - position ) ) %>%
group_by( id ) %>%
arrange( distance ) %>%
slice( 1L ) %>%
ungroup() %>%
select( text, value )
输出
# # A tibble: 4 x 2
# text value
# <chr> <dbl>
# 1 There is a 10cm length of Barrett's 10
# 2 The length of Barrett's is around 5 cm 5
# 3 The Barrett's measures 10cm in length above a 4cm hiatus hernia 10
# 4 The length of Barrett's is 5cm but the length of the dysplasia is 3cm 5