我有一个数据集,其中包含ID及其对应的短语。一个Id可以有2或3个单词的短语。在一个Id中,如果我们有2或3个单词的短语,则将2个单词的短语与3个单词短语匹配。如果匹配,请保留2个单词短语并删除3个单词短语。
Data:
id text
11 XYX not working
11 cant find anything
11 wont let go
11 wont let open
11 not working
11 let open
12 no music store
12 no sound store
12 not playing
12 not printing
12 no music
13 paper issue
13 charger issue
14 no issue found
示例:在id(11)中,“let open”与“不允许打开”匹配。所以删除'不要让开放'并保留'让开放'。 '不工作'与'XYX不工作'相匹配,所以保留'不工作'。还保留其他不匹配的短语。总是我们需要匹配我们有2和3个单词短语的短语,特别是id。
Expected output:
id text
11 cant find anything
11 wont let go
11 not working
11 let open
12 no sound store
12 not playing
12 not printing
12 no music
13 paper issue
13 charger issue
14 no issue found
提前谢谢!
答案 0 :(得分:2)
以下是使用tidyverse
系列软件包的解决方案:
library(stringr)
library(tidyverse)
is_long_phrase <- function(x) {
map_lgl(x, ~ !any(str_detect(.x, setdiff(x, .x))))
}
data %>%
group_by(id) %>%
filter(is_long_phrase(text)) %>%
ungroup()
答案 1 :(得分:1)
试试这个:
# the data
df <- read.csv(text='id,text
11,XYX not working
11,cant find anything
11,wont let go
11,wont let open
11,not working
11,let open
12,no music store
12,no sound store
12,not playing
12,not printing
12,no music
13,paper issue
13,charger issue
14,no issue found', header=TRUE, stringsAsFactors=FALSE)
# the code
df$words <- lapply(strsplit(df$text, split='\\s+'), length) # words in text
df.idlst <- split(df, df$id)
Vgrepl <- Vectorize(grepl, 'pattern', SIMPLIFY = TRUE)
df$del <- unlist(lapply(df.idlst, function(df) sapply(1:nrow(df), function(i) ifelse(df[i,]$words == 3, any(Vgrepl(df[df$words==2,]$text, df[i,]$text)), FALSE))))
df[!df$del,][1:2] # df[row,]$del == TRUE => the row has to be deleted
# the output
id text
2 11 cant find anything
3 11 wont let go
5 11 not working
6 11 let open
8 12 no sound store
9 12 not playing
10 12 not printing
11 12 no music
12 13 paper issue
13 13 charger issue
14 14 no issue found
答案 2 :(得分:0)
一个想法是创建自定义函数并将其应用于数据集
library(dplyr)
library(stringi)
fun1 <- function(x){
if(length(x) > 1) {
m1 <- expand.grid(x[stri_count_words(x) == 3], x[stri_count_words(x) == 2])
ind <- unique(m1[apply(m1, 1, function(i)length(Reduce(`intersect`, stri_extract_all_words(i)))) == 2,1])
}
return(as.character(ind))
}
df %>%
group_by(id) %>%
filter(!text %in% fun1(text))
#Source: local data frame [11 x 2]
#Groups: id [4]
# id text
# <int> <chr>
#1 11 not working
#2 11 let open
#3 11 cant find anything
#4 11 wont let go
#5 12 not playing
#6 12 not printing
#7 12 no music
#8 12 no sound store
#9 13 paper issue
#10 13 charger issue
#11 14 no issue found