Question

我正在寻找一种方法来匹配一列到另一列（但考虑字边界）。如果没有匹配，请删除整行。示例：如果NODE和SENTENCE（数据框banana != bananas）之间没有确切的令牌匹配（注意df），请删除该行。换句话说：if (\b.+\b) in NODE can't be found in SENTENCE, remove the row。

NODE     |     SENTENCE
-----------------------------------------------------------
banana         I am a banana and I like it
banana         We ate two bananas yesterday
banana         I ate a banana two days ago
coffee         Would you like a cup of coffee?
coffee         We went by that new coffeeshop the other day

结果

NODE     |     SENTENCE
-----------------------------------------------------------
banana         I am a banana and I like it
banana         I ate a banana two days ago
coffee         Would you like a cup of coffee?

我想过使用ifelse，但我并不完全确定如何应用它。

ifelse(df$NODE==df$SENTENCE,NA,???)

编辑：考虑到尼科的答案，这对我不起作用。但是，使用\\s代替\\b可以正常工作。不是-意味着一个单词边界吗？这方面的缺点是，它不会检测节点何时位于句子的开头或结尾（因为它之后不是空格字符）。

r <- c("Het label heeft ook verantwoordelijkheidsgevoel: aan de lancering van B-Camp wordt een Goodwill Project gekoppeld, een fonds dat zijn financiële bijdrage wil leveren ter bestrijding van de aids-plaag.",
    "B-Camp koos voor de opvang en verzorging van kinderen besmet met het aids-virus.",
    "Hij zei dat hij aids had.",
    "Aids in het land?")
s <- c("aids","aids","aids","aids")
d1 <- data.frame(node = s,sentence=r)

matches <- mapply(grep, paste0("(?i)\\s", d1$node, "\\s"), d1$sentence)
to.keep <- sapply(matches, length)>0
(d1 <- d1[to.keep,])

输出

node    sentence
---------------------------------
aids    Hij zei dat hij aids had.

预期输出

node    sentence
----------------
aids    Hij zei dat hij aids had.
aids    Aids in het land?

Answer 1

这是使用stringi包的可能的矢量化解决方案（尽管可能过于复杂......）

library(stringi)
indx <- as.logical(rowSums(with(df, 
                                NODE == stri_split_regex(SENTENCE,
                                "[[:punct:] ]", simplify = TRUE))))
df[indx, ]
#    NODE                        SENTENCE
# 1 banana     I am a banana and I like it
# 3 banana     I ate a banana two days ago
# 4 coffee Would you like a cup of coffee?

这里的想法是将SENTENCE转换为由标点符号或空格分割的单词矩阵，然后使用{{NODE查找是否存在完全匹配的== 1}} operator。

每个新数据集

修改

indx <- as.logical(rowSums(with(d1, 
                  node == tolower(stri_split_regex(sentence, "[ :?.,]",
                  simplify = TRUE)))))

d1[indx, ]
#  node                  sentence
# 3 aids Hij zei dat hij aids had.
# 4 aids         Aids in het land?

编辑＃2 （尝试减少＃34;资源密集型＆＃34;）

myfunc <- function(x, y) any(x == y)
indx <- with(d1, mapply(myfunc, node, stri_split_regex(tolower(sentence), "[ :?.,]")))
d1[indx, ]
#  node                  sentence
# 3 aids Hij zei dat hij aids had.
# 4 aids         Aids in het land?

Answer 2

这应该有效：

# Use grep to match \bNODE\b in SENTENCE row by row
matches <- mapply(grep, paste0("\\b", df$NODE, "\\b"), df$SENTENCE)
# Find rows with at least one match
to.keep <- sapply(matches, length)>=1
# Keep those
df[to.keep,]

请注意，如果找不到匹配项，grep会返回logical(0)，因此我使用length来测试匹配项。 sapply调用将生成一个包含每个单词匹配数的向量。

编辑：编辑问题后

您可以使用ignore.case=T使匹配不区分大小写。我更新了正则表达式以考虑句子边界。必须有一个更简单的方法......

matches <- mapply(grep, paste0("\\s", d1$node, "\\s|^", d1$node, 
           "|", d1$node, "$"), d1$sentence, ignore.case=TRUE)

如果在列B中找不到列A中的值，请删除R中的行

2 个答案: