根据两列之间的匹配值(精确)过滤数据框

时间:2016-09-30 12:39:55

标签: r dataframe match

我有一个包含两列的数据框。一列包含句子列表,其他列包含单词。例如:

words   sentences
loose   Loose connection several times a day on my tablet.  
loud    People don't speak loud or clear enough to hear voicemails
vice    I strongly advice you to fix this issue
advice  I strongly advice you to fix this issue

现在我想过滤这个数据框,这样我只得到那些与单词与句子中的单词完全匹配的行:

words   sentences
loose   Loose connection several times a day on my tablet.  
loud    People don't speak loud or clear enough to hear voicemails
advice  I strongly advice you to fix this issue   

“恶”一词并不完全匹配,因此必须将其删除。我在数据帧中有近20k行。有人可以建议我使用哪种方法来完成这项任务,这样我就不会失去太多的表现。

3 个答案:

答案 0 :(得分:3)

使用:

library(stringi)
df[stri_detect_regex(tolower(df$sentences), paste0('\\b',df$words,'\\b')),]

你得到:

   words                                                  sentences
1  loose         Loose connection several times a day on my tablet.
2   loud People don't speak loud or clear enough to hear voicemails
4 advice                    I strongly advice you to fix this issue

说明:

  • 使用tolower将句子中的大写字母转换为小写字母。
  • 通过将paste0中的字词包装在wordboundaries(words)中,使用\\b创建一个正则表达式向量。
  • 使用stringi-package中的stri_detect_regex来查看没有匹配的行,从而产生一个带有TRUE&的逻辑向量的逻辑向量。 FALSE值。
  • 带有逻辑向量的子集。

作为替代方案,您还可以使用str_detect包中的stringr(实际上是stringi包裹的包装器):

library(stringr)
df[str_detect(tolower(df$sentences), paste0('\\b',df$words,'\\b')),]

使用过的数据:

df <- structure(list(words = c("loose", "loud", "vice", "advice"), 
                     sentences = c("Loose connection several times a day on my tablet.", 
                                   "People don't speak loud or clear enough to hear voicemails", 
                                   "I strongly advice you to fix this issue", "I strongly advice you to fix this issue")), 
                .Names = c("words", "sentences"), class = "data.frame", row.names = c(NA, -4L))

答案 1 :(得分:2)

您可以尝试以下内容:

df[apply(df, 1, function(x) tolower(x[1]) %in% tolower(unlist(strsplit(x[2], split='\\s+')))),]

df
   words                                                sentences
1  loose       Loose connection several times a day on my tablet.
2   loud People dont speak loud or clear enough to hear voicemail
4 advice          advice  I strongly advice you to fix this issue

答案 2 :(得分:1)

最简单的解决方案是使用stringr包:

df<- data.frame(words=c("went","zero", "vice"), sent=c("a man went to the park","one minus one is 0","any advice?"))

df$words <- paste0(" ",df$words," ")
df$sent <- paste0(" ",df$sent," ")


df$match <- str_detect(df$sent,df$words)

df.res <- df[df$match > 0,]
df.res$match<-NULL
df.res