我正在寻找一种方法来匹配一列到另一列(但考虑字边界)。如果没有匹配,请删除整行。示例:如果NODE和SENTENCE(数据框banana != bananas
)之间没有确切的令牌匹配(注意df
),请删除该行。换句话说:if (\b.+\b) in NODE can't be found in SENTENCE, remove the row
。
NODE | SENTENCE
-----------------------------------------------------------
banana I am a banana and I like it
banana We ate two bananas yesterday
banana I ate a banana two days ago
coffee Would you like a cup of coffee?
coffee We went by that new coffeeshop the other day
结果
NODE | SENTENCE
-----------------------------------------------------------
banana I am a banana and I like it
banana I ate a banana two days ago
coffee Would you like a cup of coffee?
我想过使用ifelse
,但我并不完全确定如何应用它。
ifelse(df$NODE==df$SENTENCE,NA,???)
\\s
代替\\b
可以正常工作。不是-
意味着一个单词边界吗?这方面的缺点是,它不会检测节点何时位于句子的开头或结尾(因为它之后不是空格字符)。
r <- c("Het label heeft ook verantwoordelijkheidsgevoel: aan de lancering van B-Camp wordt een Goodwill Project gekoppeld, een fonds dat zijn financiële bijdrage wil leveren ter bestrijding van de aids-plaag.",
"B-Camp koos voor de opvang en verzorging van kinderen besmet met het aids-virus.",
"Hij zei dat hij aids had.",
"Aids in het land?")
s <- c("aids","aids","aids","aids")
d1 <- data.frame(node = s,sentence=r)
matches <- mapply(grep, paste0("(?i)\\s", d1$node, "\\s"), d1$sentence)
to.keep <- sapply(matches, length)>0
(d1 <- d1[to.keep,])
输出
node sentence
---------------------------------
aids Hij zei dat hij aids had.
预期输出
node sentence
----------------
aids Hij zei dat hij aids had.
aids Aids in het land?
答案 0 :(得分:3)
这是使用stringi
包的可能的矢量化解决方案(尽管可能过于复杂......)
library(stringi)
indx <- as.logical(rowSums(with(df,
NODE == stri_split_regex(SENTENCE,
"[[:punct:] ]", simplify = TRUE))))
df[indx, ]
# NODE SENTENCE
# 1 banana I am a banana and I like it
# 3 banana I ate a banana two days ago
# 4 coffee Would you like a cup of coffee?
这里的想法是将SENTENCE
转换为由标点符号或空格分割的单词矩阵,然后使用{{NODE
查找是否存在完全匹配的==
1}} operator。
修改
indx <- as.logical(rowSums(with(d1,
node == tolower(stri_split_regex(sentence, "[ :?.,]",
simplify = TRUE)))))
d1[indx, ]
# node sentence
# 3 aids Hij zei dat hij aids had.
# 4 aids Aids in het land?
编辑#2 (尝试减少#34;资源密集型&#34;)
myfunc <- function(x, y) any(x == y)
indx <- with(d1, mapply(myfunc, node, stri_split_regex(tolower(sentence), "[ :?.,]")))
d1[indx, ]
# node sentence
# 3 aids Hij zei dat hij aids had.
# 4 aids Aids in het land?
答案 1 :(得分:2)
这应该有效:
# Use grep to match \bNODE\b in SENTENCE row by row
matches <- mapply(grep, paste0("\\b", df$NODE, "\\b"), df$SENTENCE)
# Find rows with at least one match
to.keep <- sapply(matches, length)>=1
# Keep those
df[to.keep,]
请注意,如果找不到匹配项,grep会返回logical(0)
,因此我使用length
来测试匹配项。 sapply
调用将生成一个包含每个单词匹配数的向量。
编辑:编辑问题后
您可以使用ignore.case=T
使匹配不区分大小写。
我更新了正则表达式以考虑句子边界。必须有一个更简单的方法......
matches <- mapply(grep, paste0("\\s", d1$node, "\\s|^", d1$node,
"|", d1$node, "$"), d1$sentence, ignore.case=TRUE)