如果在列B中找不到列A中的值,请删除R中的行

时间:2015-04-06 09:28:14

标签: regex r

我正在寻找一种方法来匹配一列到另一列(但考虑字边界)。如果没有匹配,请删除整行。示例:如果NODE和SENTENCE(数据框banana != bananas)之间没有确切的令牌匹配(注意df),请删除该行。换句话说:if (\b.+\b) in NODE can't be found in SENTENCE, remove the row

NODE     |     SENTENCE
-----------------------------------------------------------
banana         I am a banana and I like it
banana         We ate two bananas yesterday
banana         I ate a banana two days ago
coffee         Would you like a cup of coffee?
coffee         We went by that new coffeeshop the other day

结果

NODE     |     SENTENCE
-----------------------------------------------------------
banana         I am a banana and I like it
banana         I ate a banana two days ago
coffee         Would you like a cup of coffee?

我想过使用ifelse,但我并不完全确定如何应用它。

ifelse(df$NODE==df$SENTENCE,NA,???)

编辑:考虑到尼科的答案,这对我不起作用。但是,使用\\s代替\\b可以正常工作。不是-意味着一个单词边界吗?这方面的缺点是,它不会检测节点何时位于句子的开头或结尾(因为它之后不是空格字符)。

r <- c("Het label heeft ook verantwoordelijkheidsgevoel: aan de lancering van B-Camp wordt een Goodwill Project gekoppeld, een fonds dat zijn financiële bijdrage wil leveren ter bestrijding van de aids-plaag.",
    "B-Camp koos voor de opvang en verzorging van kinderen besmet met het aids-virus.",
    "Hij zei dat hij aids had.",
    "Aids in het land?")
s <- c("aids","aids","aids","aids")
d1 <- data.frame(node = s,sentence=r)

matches <- mapply(grep, paste0("(?i)\\s", d1$node, "\\s"), d1$sentence)
to.keep <- sapply(matches, length)>0
(d1 <- d1[to.keep,])

输出

node    sentence
---------------------------------
aids    Hij zei dat hij aids had.       

预期输出

node    sentence
----------------
aids    Hij zei dat hij aids had.
aids    Aids in het land?

2 个答案:

答案 0 :(得分:3)

这是使用stringi包的可能的矢量化解决方案(尽管可能过于复杂......)

library(stringi)
indx <- as.logical(rowSums(with(df, 
                                NODE == stri_split_regex(SENTENCE,
                                "[[:punct:] ]", simplify = TRUE))))
df[indx, ]
#    NODE                        SENTENCE
# 1 banana     I am a banana and I like it
# 3 banana     I ate a banana two days ago
# 4 coffee Would you like a cup of coffee?

这里的想法是将SENTENCE转换为由标点符号或空格分割的单词矩阵,然后使用{{NODE查找是否存在完全匹配的== 1}} operator。


每个新数据集

修改

indx <- as.logical(rowSums(with(d1, 
                  node == tolower(stri_split_regex(sentence, "[ :?.,]",
                  simplify = TRUE)))))

d1[indx, ]
#  node                  sentence
# 3 aids Hij zei dat hij aids had.
# 4 aids         Aids in het land?

编辑#2 (尝试减少#34;资源密集型&#34;)

myfunc <- function(x, y) any(x == y)
indx <- with(d1, mapply(myfunc, node, stri_split_regex(tolower(sentence), "[ :?.,]")))
d1[indx, ]
#  node                  sentence
# 3 aids Hij zei dat hij aids had.
# 4 aids         Aids in het land?

答案 1 :(得分:2)

这应该有效:

# Use grep to match \bNODE\b in SENTENCE row by row
matches <- mapply(grep, paste0("\\b", df$NODE, "\\b"), df$SENTENCE)
# Find rows with at least one match
to.keep <- sapply(matches, length)>=1
# Keep those
df[to.keep,]

请注意,如果找不到匹配项,grep会返回logical(0),因此我使用length来测试匹配项。 sapply调用将生成一个包含每个单词匹配数的向量。

编辑:编辑问题后

您可以使用ignore.case=T使匹配不区分大小写。 我更新了正则表达式以考虑句子边界。必须有一个更简单的方法......

matches <- mapply(grep, paste0("\\s", d1$node, "\\s|^", d1$node, 
           "|", d1$node, "$"), d1$sentence, ignore.case=TRUE)