是否有任何R包用于识别两个不同数据帧的两个文本字符串列之间的第1,第2,第3,第4匹配的位置(rowindex)?
例如:
我有以下数据框:
dataframe: simpletext
row text
1 does he go to that bar or for shopping?
2 where was that bar that I wanted?
3 I would like to go to the opera instead for shopping
dataframe: keywords
row word
1 shopping
2 opera
3 bar
我想要的是发现simpletext $ text [1]的第一个匹配是关键字$ word [3]
simpletext $ text [1]的第二个匹配是关键字$ word [1],依此类推每行或simpletext
答案 0 :(得分:0)
你可能会从这样的事情开始:
library(tidyverse)
find_locations <- function(word, text) {
bind_cols(
data_frame(
word = word,
text = text
),
as_data_frame(str_locate(text, word))
)
}
map_df(keywords$word, find_locations, text = simpletext$text)
答案 1 :(得分:0)
您可以使用regexpr
(grep
系列)功能:
keywords = rbind("shopping","opera","bar")
simpletext = rbind("does he go to that bar or for shopping?",
"where was that bar that I wanted?",
"I would like to go to the opera instead for shopping")
text_match <- function(text,keywords)
{
# check all keywords for matching
matches <- vapply(keywords[1:length(keywords)], function(x) regexpr(x,text)[1], FUN.VALUE=1)
# sort matched keywords in order of appearance
sorted_matches <- names(sort(matches[matches>0]))
# return indices of sorted matches
indices <- vapply(sorted_matches, function(x) which(keywords == x),FUN.VALUE=1)
return (indices)
}
其中regexpr(x,text)[1]
返回x
中text
或-1
的第一个匹配位置(如果没有)。
结果如下:
text_match(simpletext[1],keywords)
#bar shopping
#3 1
text_match(simpletext[2],keywords)
# bar
# 3
text_match(simpletext[3],keywords)
# opera shopping
# 2 1