Question

我有一个包含非结构化文本数据的数据集。

从文本中我想提取具有以下单词的句子：

education_vector <- c("university", "academy", "school", "college")

例如，我希望获得I am a student at the University of Wyoming. My major is biology.

的文字I am a student at the University of Wyoming.

从文字I love statistics and I enjoy working with numbers. I graduated from Walla Wall Community College我想获得I graduated from Walla Wall Community College.等等

我尝试使用grep函数，但返回了错误的结果

Answer 1

修改答案以选择第一次匹配。

texts = c("I am a student at the University of Wyoming. My major is biology.",
"I love statistics and I enjoy working with numbers. I graduated from Walla Wall Community College",
"First, I went to the Bowdoin College. Then I went to the University of California.")

gsub(".*?([^\\.]*(university|academy|school|college)[^\\.]*).*", 
    "\\1", texts, ignore.case=TRUE)

[1] "I am a student at the University of Wyoming"   
[2] " I graduated from Walla Wall Community College"
[3] "First, I went to the Bowdoin College"

说明： .*?与模式的其余部分相比是非贪婪的匹配。这是在相关句子之前删除任何句子。

([^\\.]*(university|academy|school|college)[^\\.]*)匹配其中一个关键词之前和之后的以外的任何字符串。

.*在相关句子之后处理任何事情。

这将仅使用相关部分替换整个字符串。

Answer 2

以下是使用grep

的解决方案

education <- c("university", "academy", "school", "college")

str1 <- "I am a student at the University of Wyoming. My major is biology."
str2 <- "I love statistics and I enjoy working with numbers. I graduated from Walla Wall Community College"
str1 <- tolower(str1) # we use tolower because "university" != "University"
str2 <- tolower(str2)

grep(paste(education, collapse = "|"), unlist(strsplit(str1, "(?<=\\.)\\s+",
                                                       perl = TRUE)),
     value = TRUE)

grep(paste(education, collapse = "|"), unlist(strsplit(str2, "(?<=\\.)\\s+",
                                                       perl = TRUE)),
     value = TRUE)

提取具有模式的句子

2 个答案: