我有句子列表和单词列表,我想更新每个句子,只保留单词列表中的单词。
例如我有以下单词
“USA”, “英国”, “德国”, “澳大利亚”, “意大利”, “中”, “到”
和以下句子:
“我在德国生活了2年”,“我从意大利搬到了美国”,“美国,英国和澳大利亚的人说英语”
我想删除单词列表中没有出现的句子中的所有单词 所以预期的输出是以下句子: “在德国”,“意大利到美国”,“在美国英国澳大利亚”
如何使用应用功能
mywords=data.frame(words=c("USA","UK","Germany","Australia","Italy","in","to"),
stringsAsFactors = F)
mysentences=data.frame(sentences=c("I lived in Germany 2 years",
"I moved from Italy to USA",
"people in USA, UK and Australia speak English"),
stringsAsFactors = F)
答案 0 :(得分:2)
如果将此文本转换为整洁的数据格式,则可以使用联接来查找匹配的单词。然后,您可以使用purrr::map_chr()
返回所需的字符串。
library(tidyverse)
library(tidytext)
mywords <- data_frame(word = c("USA","UK","Germany","Australia","Italy","in","to"))
mysentences <- data_frame(sentences = c("I lived in Germany 2 years",
"I moved from Italy to USA",
"people in USA, UK and Australia speak English"))
mysentences %>%
mutate(id = row_number()) %>%
unnest_tokens(word, sentences, to_lower = FALSE) %>%
inner_join(mywords) %>%
nest(-id) %>%
mutate(sentences = map(data, unlist),
sentences = map_chr(sentences, paste, collapse = " ")) %>%
select(-data)
#> Joining, by = "word"
#> # A tibble: 3 × 2
#> id sentences
#> <int> <chr>
#> 1 1 in Germany
#> 2 2 Italy to USA
#> 3 3 in USA UK Australia
答案 1 :(得分:2)
这是两种方法。第一个将单词列表折叠为正则表达式,然后使用str_detect
将单词与正则表达式匹配:
library(tidyverse)
library(glue)
mywords=data_frame(words=c("USA","UK","Germany","Australia","Italy","in","to"))
mysentences=data_frame(sentences=c("This is a sentence with no words of word list",
"I lived in Germany 2 years",
"I moved from Italy to USA",
"people in USA, UK and Australia speak English"))
mysentences %>%
filter(sentences %>%
str_detect(mywords$words %>% collapse(sep = "|") %>% regex(ignore_case = T)))
#> # A tibble: 3 × 1
#> sentences
#> <chr>
#> 1 I lived in Germany 2 years
#> 2 I moved from Italy to USA
#> 3 people in USA, UK and Australia speak English
第二种方法使用fuzzyjoin
的{{1}}(在幕后使用regex_semi_join
并为您完成上述工作)
str_detect
答案 2 :(得分:1)
您也可以使用stringr。我很抱歉发布了两次。这是错误的。
vect <- c("USA","UK","Germany","Australia","Italy","in","to")
sentence <- c("I lived in Germany 2 years", "I moved from Italy to USA", "people in USA, UK and Australia speak English")
library(stringr)
li <- str_extract_all(sentence,paste0(vect,collapse="|"))
d <- list()
for(i in 1:length(li){
d[i] <- paste(li[[i]],collapse=" ")
}
unlist(d)
输出:
> unlist(d)
[1] "in Germany"
[2] "Italy to USA"
[3] "in USA UK Australia"
答案 3 :(得分:1)
这适用于较短的单词列表
library(stringr)
mywords_regex <- paste0(mywords$word, collapse = "|")
sapply(str_extract_all(mysentences$sentences, mywords_regex), paste, collapse = " ")
[1] "in Germany" "Italy to USA" "in USA UK Australia"
答案 4 :(得分:0)
全部谢谢,
我通过以下代码解决了这个问题,该代码是使用交叉函数
从answer启发的vect <- data.frame( c("USA","UK","Germany","Australia","Italy","in","to"),stringsAsFactors = F)
sentence <- data.frame(c("I lived in Germany 2 years", "I moved from Italy to USA",
"people in USA UK and Australia speak English"),stringsAsFactors = F)
sentence[,1]=gsub("[^[:alnum:] ]", "", sentence[,1]) #remove special characters
sentence[,1]=sapply(sentence[,1], FUN = function(x){ paste(intersect(strsplit(x, "\\s")[[1]], vect[,1]), collapse=" ")})