R:提取零宽超前行的双字

时间:2019-01-22 18:35:02

标签: r regex stringr lookahead

我想使用描述为here的正则表达式从句子中提取二元组,并将输出存储到引用原始列的新列中。

enter image description here

library(dplyr)
library(stringr)
library(splitstackshape)

df  <- data.frame(a =c("apple orange plum"))

# Single Words - Successful
df %>%
  # Base R
  mutate(b =  sapply(regmatches(a,gregexpr("\\w+\\b", a, perl = TRUE)),
                     paste, collapse=";")) %>%
  # Duplicate with Stringr
  mutate(c =  sapply(str_extract_all(a,"\\w+\\b"),paste, collapse=";")) %>%
  cSplit(., c(2,3), sep = ";", direction = "long")

最初,我认为问题似乎出在正则表达式引擎上,但是stringr::str_extract_all(ICU)和base::regmatches(PCRE)都不起作用。

# Bigrams - Fails
df %>%
  # Base R
  mutate(b =  sapply(regmatches(a,gregexpr("(?=(\\b\\w+\\s+\\w+))", a, perl = TRUE)),
                     paste, collapse=";")) %>%
  # Duplicate with Stringr
  mutate(c =  sapply(str_extract_all(a,"(?=(\\b\\w+\\s+\\w+))"),paste, collapse=";")) %>%
  cSplit(., c(2,3), sep = ";", direction = "long")

因此,我猜测问题可能与在捕获组周围使用零宽度的超前查找有关。 R中是否有任何有效的正则表达式可以提取这些二元组?

1 个答案:

答案 0 :(得分:1)

如@WiktorStribiżew建议的那样,在此处使用str_extract_all会有所帮助。这是在数据框中将其应用于多行的方法。让

(df <- data.frame(a = c("one two three", "four five six")))
#               a
# 1 one two three
# 2 four five six

那我们可以做

df %>% rowwise() %>% 
  do(data.frame(., b = str_match_all(.$a, "(?=(\\b\\w+\\s+\\w+))")[[1]][, 2], stringsAsFactors = FALSE))
# Source: local data frame [4 x 2]
# Groups: <by row>
#
# A tibble: 4 x 2
#   a             b        
# * <fct>         <chr>    
# 1 one two three one two  
# 2 one two three two three
# 3 four five six four five
# 4 four five six five six

其中stringsAsFactors = FALSE只是为了避免来自绑定行的警告。