在quanteda
中,是否有一种方法可以在两个单词同时出现的情况下选择一个句子?我找到了将文本语料库标记成句子的方法。玩kwic
和tokens_select
似乎表明它们对这两个词执行逻辑或,而不是与。
我可以用stringr
做题,但我想确保自己没有错过任何事情
带有字符串的示例:
library(tidyverse)
myStr <- c("soil carbon is the best",
"biodiversity is key",
"soil carbon is biodiversity by nature")
keyw <- c("soil","biodiversity")
tibble(sentences = myStr,
hit_soil_carbon_biodiveristy = unlist(purrr::map(myStr,~all(str_detect(.x,keyw)))))
谢谢您的投入!
答案 0 :(得分:2)
是-您可以使用kwic()
隔离词组(序列),然后将所选句子重新组成仅包含所选句子的新语料库。通过设置kwic window = 1000
,您可以确保选择非常长的句子(2000 + 2个标记)。
library("quanteda")
# reformat the corpus as sentences
sentcorp <- corpus_reshape(data_corpus_inaugural, to = "sentences")
tail(texts(sentcorp))
# 2017-Trump.83
# "Together, we will make America strong again."
# 2017-Trump.84
# "We will make America wealthy again."
# 2017-Trump.85
# "We will make America proud again."
# 2017-Trump.86
# "We will make America safe again."
# 2017-Trump.87
# "And, yes, together, we will make America great again."
# 2017-Trump.88
# "Thank you, God bless you, and God bless America."
# illustrate the selection
kwic(sentcorp, phrase("nuclear w*"), window = 3)
# [1977-Carter.47, 18:19] elimination of all | nuclear weapons | from this Earth
# [1985-Reagan.88, 12:13] further increase of | nuclear weapons | .
# [1985-Reagan.90, 9:10] one day of | nuclear weapons | from the face
# [1985-Reagan.91, 27:28] the use of | nuclear weapons | , the other
# [1985-Reagan.96, 4:5] It would render | nuclear weapons | obsolete.
# now pipe the longer kwic results back into a corpus
newsentcorp <-
kwic(sentcorp, phrase("nuclear w*"), window = 1000) %>%
corpus(split_context = FALSE) %>%
texts()
newsentcorp[-4] # because 4 is really long
# 1977-Carter.47.L18
# "And we will move this year a step toward ultimate goal - - the elimination of all nuclear weapons from this Earth."
# 1985-Reagan.88.L12
# "We are not just discussing limits on a further increase of nuclear weapons."
# 1985-Reagan.90.L9
# "We seek the total elimination one day of nuclear weapons from the face of the Earth."
# 1985-Reagan.96.L4
# "It would render nuclear weapons obsolete."