Question

我正在处理如下数据框

        id            Comments
        1             The apple fell far from the mango tree
        2             I was born under a mango tree and a wandering star      
        3             Mules are made for packing and Mangoes for eating

我感兴趣的是芒果这个词之前的4个单词和之后的4个单词，包括芒果这个词。

最终数据集将如下所示。

        id            Comments
        1             far from the mango tree
        2             born under a mango tree and a      
        3             for packing and Mangoes for eating

这是测试可重现的数据集

df <- read.table(text="Id,Comment
 1,The apple fell far from the mango tree
                 2,I was born under a mango tree and a wandering star      
                 3,Mules are made for packing and Mangoes for eating", header=T, sep=",")

对这个很有用的任何见解

Answer 1

我使用了非常好的stringi包和正则表达式技术：

library(stringi)
apply(df,1, function(myrow){
   stri_match_all_regex(myrow[2], "(\\p{L}+\\p{Z}){0,3}(mango\\p{L}*|Mango\\p{L}*)(\\p{Z}\\p{L}+){0,3}")[[1]][1,1]
   })

所以我在mango（(\\p{L}+\\p{Z}){0,3}）之前从0到3个单词，在芒果或芒果之后加上最后一个字母（(mango\\p{L}*|Mango\\p{L}*)），之后再从0到0 3个字（(\\p{Z}\\p{L}+){0,3}）

\p{Z}是一个空格，\p{L}是一个字母。

Answer 2

这似乎有效：

sapply(
  strsplit(as.character(df$Comment), " "),
  function(x){
    w = grep("[m|M]ango", x)[1]
    paste(x[ seq(max(1,w-3), min(length(x),w+3)) ], collapse=" ") 
  }
)
# [1] "far from the mango tree"           
# [2] "born under a mango tree and a"     
# [3] "for packing and Mangoes for eating"

grep(...)[1]表示只使用第一个芒果匹配。

字符串拆分和条件粘贴

2 个答案: