R中文本的部分匹配

时间:2017-01-06 16:28:40

标签: r grep

我有一个数据集,其中包含ID及其对应的短语。一个Id可以有2或3个单词的短语。在一个Id中,如果我们有2或3个单词的短语,则将2个单词的短语与3个单词短语匹配。如果匹配,请保留2个单词短语并删除3个单词短语。

 Data:
          id         text
          11    XYX not working
          11    cant find anything
          11    wont let go
          11    wont let open
          11    not working
          11    let open
          12    no music store
          12    no sound store
          12    not playing
          12    not printing
          12    no music
          13    paper issue
          13    charger issue
          14    no issue found

示例:在id(11)中,“let open”与“不允许打开”匹配。所以删除'不要让开放'并保留'让开放'。 '不工作'与'XYX不工作'相匹配,所以保留'不工作'。还保留其他不匹配的短语。总是我们需要匹配我们有2和3个单词短语的短语,特别是id。

 Expected output:

          id          text
          11    cant find anything
          11    wont let go
          11    not working
          11    let open
          12    no sound store
          12    not playing
          12    not printing
          12    no music
          13    paper issue
          13    charger issue
          14    no issue found

提前谢谢!

3 个答案:

答案 0 :(得分:2)

以下是使用tidyverse系列软件包的解决方案:

library(stringr)
library(tidyverse)

is_long_phrase <- function(x) {
  map_lgl(x, ~ !any(str_detect(.x, setdiff(x, .x))))
}

data %>%
  group_by(id) %>% 
  filter(is_long_phrase(text)) %>% 
  ungroup()

答案 1 :(得分:1)

试试这个:

# the data
df <- read.csv(text='id,text
                 11,XYX not working
                 11,cant find anything
                 11,wont let go
                 11,wont let open
                 11,not working
                 11,let open
                 12,no music store
                 12,no sound store
                 12,not playing
                 12,not printing
                 12,no music
                 13,paper issue
                 13,charger issue
                 14,no issue found', header=TRUE, stringsAsFactors=FALSE)

# the code
df$words <- lapply(strsplit(df$text, split='\\s+'), length) # words in text
df.idlst <- split(df, df$id) 
Vgrepl <- Vectorize(grepl, 'pattern', SIMPLIFY = TRUE)
df$del <- unlist(lapply(df.idlst, function(df) sapply(1:nrow(df), function(i) ifelse(df[i,]$words == 3, any(Vgrepl(df[df$words==2,]$text, df[i,]$text)), FALSE))))
df[!df$del,][1:2] # df[row,]$del == TRUE => the row has to be deleted

# the output
   id               text
2  11 cant find anything
3  11        wont let go
5  11        not working
6  11           let open
8  12     no sound store
9  12        not playing
10 12       not printing
11 12           no music
12 13        paper issue
13 13      charger issue
14 14     no issue found

答案 2 :(得分:0)

一个想法是创建自定义函数并将其应用于数据集

library(dplyr)
library(stringi)

fun1 <- function(x){
  if(length(x) > 1) {
    m1 <- expand.grid(x[stri_count_words(x) == 3], x[stri_count_words(x) == 2])
    ind <- unique(m1[apply(m1, 1, function(i)length(Reduce(`intersect`, stri_extract_all_words(i)))) == 2,1])
  }
  return(as.character(ind))
}

df %>% 
  group_by(id) %>% 
  filter(!text %in% fun1(text))

#Source: local data frame [11 x 2]
#Groups: id [4]

#      id               text
#   <int>              <chr>
#1     11        not working
#2     11           let open
#3     11 cant find anything
#4     11        wont let go
#5     12        not playing
#6     12       not printing
#7     12           no music
#8     12     no sound store
#9     13        paper issue
#10    13      charger issue
#11    14     no issue found