使用R中的stringr提取特定单词周围的单词样本

时间:2015-12-21 19:54:11

标签: regex r stringr

我已经在SO上发布了关于此主题的几个类似问题,但它们似乎措辞不当(example)或使用其他语言(example)。

在我的场景中,我认为白色空间所包围的一切都是一个词。表情符号,数字,字母不是真正的单词,我不在乎。我只想获得一些关于找到的字符串的上下文,而不必读取整个文件来确定它是否是有效匹配。

我尝试使用以下内容,但如果您有一个长文本文件则需要一段时间才能运行:

text <- "He served both as Attorney General and Lord Chancellor of England. After his death, he remained extremely influential through his works, especially as philosophical advocate and practitioner of the scientific method during the scientific revolution. Bacon has been called the father of empiricism.[6] His works argued for the possibility of scientific knowledge based only upon inductive and careful observation of events in nature. Most importantly, he argued this could be achieved by use of a skeptical and methodical approach whereby scientists aim to avoid misleading themselves. While his own practical ideas about such a method, the Baconian method, did not have a long lasting influence, the general idea of the importance and possibility of a skeptical methodology makes Bacon the father of scientific method. This marked a new turn in the rhetorical and theoretical framework for science, the practical details of which are still central in debates about science and methodology today. Bacon was knighted in 1603 and created Baron Verulam in 1618[4] and Viscount St. Alban in 1621;[3][b] as he died without heirs, both titles became extinct upon his death. Bacon died of pneumonia in 1626, with one account by John Aubrey stating he contracted the condition while studying the effects of freezing on the preservation of meat."

stringr::str_extract(text, "(.*?\\s){1,10}Verulam(\\s.*?){1,10}")

我假设有更多,更快/更有效的方式来做到这一点,是吗?

3 个答案:

答案 0 :(得分:5)

试试这个:

stringr::str_extract(text, "([^\\s]+\\s){3}Verulam(\\s[^\\s]+){3}")
# alternately, if you like " " more than \\s:
# stringr::str_extract(text, "(?:[^ ]+ ){3}Verulam(?: [^ ]+){3}")

#[1] "and created Baron Verulam in 1618[4] and"

更改{}内的数字以满足您的需求。

您也可以使用非捕获(?:)组,但我还不确定这是否会提高速度。

stringr::str_extract(text, "(?:[^\\s]+\\s){3}Verulam(?:\\s[^\\s]+){3}")

答案 1 :(得分:3)

我使用unlist(strsplit)然后索引生成的向量。你可以把它变成一个函数,这样得到pre和post的单词数就是一个灵活的参数:

getContext <- function(text, look_for, pre = 3, post=pre) {
  # create vector of words (anything separated by a space)
  t_vec <- unlist(strsplit(text, '\\s'))

  # find position of matches
  matches <- which(t_vec==look_for)

  # return words before & after if any matches
  if(length(matches) > 0) {
    out <- 
      list(before = ifelse(m-pre < 1, NA, 
                           sapply(matches, function(m) t_vec[(m - pre):(m - 1)])), 
           after = sapply(matches, function(m) t_vec[(m + 1):(m + post)]))

    return(out)
  } else {
    warning('No matches')
  }
}

适用于单场比赛

getContext(text, 'Verulam')

# $before
#      [,1]     
# [1,] "and"    
# [2,] "created"
# [3,] "Baron"  
# 
# $after
#      [,1]     
# [1,] "in"     
# [2,] "1618[4]"
# [3,] "and"   

如果有多个匹配

也可以
getContext(text, 'he')

# $before
#      [,1]     [,2]           [,3]          [,4]     
# [1,] "After"  "nature."      "in"          "John"   
# [2,] "his"    "Most"         "1621;[3][b]" "Aubrey" 
# [3,] "death," "importantly," "as"          "stating"
# 
# $after
#      [,1]          [,2]     [,3]      [,4]        
# [1,] "remained"    "argued" "died"    "contracted"
# [2,] "extremely"   "this"   "without" "the"       
# [3,] "influential" "could"  "heirs,"  "condition" 

getContext(text, 'fruitloops')
# Warning message:
#   In getContext(text, "fruitloops") : No matches

答案 2 :(得分:1)

如果您不介意重复数据,可以创建一个data.frame,这通常是在R中使用的最佳选择。

context <- function(text){
  splittedText <- strsplit(text, ' ', T)[[1]]
  print(splittedText)

  data.frame(
    words  = splittedText,
    before = head(c('', splittedText), -1), 
    after  = tail(c(splittedText, ''), -1)
  )
}

更清洁的IMO:

info <- context(text)

print(subset(info, words == 'Verulam'))

print(subset(info, before == 'Lord'))

print(subset(info, grepl('[[:digit:]]', words)))

#       words before #after
# 161 Verulam  Baron    in
#        words before after
# 9 Chancellor   Lord    of
#             words before after
# 43  empiricism.[6]     of   His
# 157           1603     in   and
# 163        1618[4]     in   and
# 169    1621;[3][b]     in    as
# 187          1626,     in  with