Question

我正在尝试将字符串解析为其部分，检查每个部分是否存在于单独的词汇表中，然后仅重新组装其部分位于词汇表中的字符串。词汇表是单词的向量，与我想要比较的字符串分开创建。最终目标是创建一个只包含词汇部分在词汇表中的字符串的数据框。

我编写了一段代码来将数据解析为字符串，但无法弄清楚如何进行比较。如果您认为解析数据不是最佳解决方案，请告诉我。

这是一个例子：假设我有三个字符串：

"The elephant in the room is blue",
"The dog cannot swim",
"The cat is blue"

我的词汇包括以下词语：

cat,    **the**,    **elephant**,    hippo,
**in**,    run,    **is**,    bike,
walk,    **room, is, blue, cannot**

在这种情况下，我将只选择第一个和第三个字符串，因为它们的每个单词部分都与我词汇表中的单词匹配。我不会选择第二个字符串，因为词汇“dog”和“swim”不在词汇表中。

谢谢！

每个请求，附带的是我到目前为止编写的代码，用于清理字符串，并将它们解析为唯一的单词：

animals <- c("The elephant in the room is blue", "The dog cannot swim", "The cat is blue")

animals2 <- toupper(animals)
animals2 <- gsub("[[:punct:]]", " ", animals2)
animals2 <- gsub("(^ +)|( +$)|(  +)", " ", animals2)

## Parse the characters and select unique words only
animals2 <- unlist(strsplit(animals2," "))
animals2 <- unique(animals2)

Answer 1

我将如何做：

阅读数据
清除词汇以删除多余的空格和*
使用setdiff

我的代码是：

## read your data
tt <- c("The elephant in the room is blue",
"The dog cannot swim",
"The cat is blue")
vocab <- scan(textConnection('cat,    **the**,    **elephant**,    hippo,
**in**,    run,    **is**,    bike,
walk,    **room, is, blue, cannot**'),sep=',',what='char')
## polish vocab
vocab <- gsub('\\s+|[*]+','',vocab)
vocab <- vocab[nchar(vocab) >0]
##
 sapply(tt,function(x){
+     x.words <- tolower(unlist(strsplit(x,' '))) ## take lower (the==The)
+     length(setdiff(x.words ,vocab)) ==0
+ })
The elephant in the room is blue              The dog cannot swim                  The cat is blue 
                            TRUE                            FALSE                             TRUE

解析一个字符串，然后重新组装它

1 个答案: