假设我们将一个全文本文件作为字符向量加载到R中。我正在寻找一个代码,它将在两个“。”之间拉出所有文本,只要在这两个句点之间,存在“和”和至少一个“%”。
character <- as.character("Walmart stocks remained the same. Sony reported an increase, and the percent was posted at 1.0%. And the google also remained the same. And the percent of increase for Best Buy was 2.5%.")
看一下这个简短的例子,我希望在
的某个地方输出一个输出[1] Sony reported an increase, and the percent was posted at 1.0%.
[2] And the percent of increase for Best Buy was 2.5%.
答案 0 :(得分:1)
快速解决方案:
library(magrittr)
"Walmart stocks remained the same. Sony reported an increase, and the percent was posted at 1.0%. And the google also remained the same. And the percent of increase for Best Buy was 2.5%." %>%
## split the string at the sentence boundaries
gsub("\\.\\s", "\\.\t", .) %>%
strsplit("\\t") %>% unlist() %>%
## keep only sentences that contain "and the" (irrespective of case)
grep("and the", x = ., value = TRUE, ignore.case = TRUE) %>%
## keep only the sentences that end with %.
grep("%\\.$", x = ., value = TRUE) %>%
## remove leading white spaces
gsub("^\\s?", "", x = .)