我有一个包含两列的数据框。一列包含句子列表,其他列包含单词。例如:
words sentences
loose Loose connection several times a day on my tablet.
loud People don't speak loud or clear enough to hear voicemails
vice I strongly advice you to fix this issue
advice I strongly advice you to fix this issue
现在我想过滤这个数据框,这样我只得到那些与单词与句子中的单词完全匹配的行:
words sentences
loose Loose connection several times a day on my tablet.
loud People don't speak loud or clear enough to hear voicemails
advice I strongly advice you to fix this issue
“恶”一词并不完全匹配,因此必须将其删除。我在数据帧中有近20k行。有人可以建议我使用哪种方法来完成这项任务,这样我就不会失去太多的表现。
答案 0 :(得分:3)
使用:
library(stringi)
df[stri_detect_regex(tolower(df$sentences), paste0('\\b',df$words,'\\b')),]
你得到:
words sentences
1 loose Loose connection several times a day on my tablet.
2 loud People don't speak loud or clear enough to hear voicemails
4 advice I strongly advice you to fix this issue
说明:
tolower
将句子中的大写字母转换为小写字母。paste0
中的字词包装在wordboundaries(words
)中,使用\\b
创建一个正则表达式向量。stri_detect_regex
来查看没有匹配的行,从而产生一个带有TRUE
&的逻辑向量的逻辑向量。 FALSE
值。作为替代方案,您还可以使用str_detect
包中的stringr
(实际上是stringi
包裹的包装器):
library(stringr)
df[str_detect(tolower(df$sentences), paste0('\\b',df$words,'\\b')),]
使用过的数据:
df <- structure(list(words = c("loose", "loud", "vice", "advice"),
sentences = c("Loose connection several times a day on my tablet.",
"People don't speak loud or clear enough to hear voicemails",
"I strongly advice you to fix this issue", "I strongly advice you to fix this issue")),
.Names = c("words", "sentences"), class = "data.frame", row.names = c(NA, -4L))
答案 1 :(得分:2)
您可以尝试以下内容:
df[apply(df, 1, function(x) tolower(x[1]) %in% tolower(unlist(strsplit(x[2], split='\\s+')))),]
df
words sentences
1 loose Loose connection several times a day on my tablet.
2 loud People dont speak loud or clear enough to hear voicemail
4 advice advice I strongly advice you to fix this issue
答案 2 :(得分:1)
最简单的解决方案是使用stringr包:
df<- data.frame(words=c("went","zero", "vice"), sent=c("a man went to the park","one minus one is 0","any advice?"))
df$words <- paste0(" ",df$words," ")
df$sent <- paste0(" ",df$sent," ")
df$match <- str_detect(df$sent,df$words)
df.res <- df[df$match > 0,]
df.res$match<-NULL
df.res