我有一个包含非结构化文本数据的数据集。
从文本中我想提取具有以下单词的句子:
education_vector <- c("university", "academy", "school", "college")
例如,我希望获得I am a student at the University of Wyoming. My major is biology.
I am a student at the University of Wyoming.
从文字I love statistics and I enjoy working with numbers. I graduated from Walla Wall Community College
我想获得I graduated from Walla Wall Community College.
等等
我尝试使用grep
函数,但返回了错误的结果
答案 0 :(得分:1)
修改答案以选择第一次匹配。
texts = c("I am a student at the University of Wyoming. My major is biology.",
"I love statistics and I enjoy working with numbers. I graduated from Walla Wall Community College",
"First, I went to the Bowdoin College. Then I went to the University of California.")
gsub(".*?([^\\.]*(university|academy|school|college)[^\\.]*).*",
"\\1", texts, ignore.case=TRUE)
[1] "I am a student at the University of Wyoming"
[2] " I graduated from Walla Wall Community College"
[3] "First, I went to the Bowdoin College"
说明:
.*?
与模式的其余部分相比是非贪婪的匹配。这是在相关句子之前删除任何句子。
([^\\.]*(university|academy|school|college)[^\\.]*)
匹配其中一个关键词之前和之后的以外的任何字符串。
.*
在相关句子之后处理任何事情。
这将仅使用相关部分替换整个字符串。
答案 1 :(得分:0)
以下是使用grep
education <- c("university", "academy", "school", "college")
str1 <- "I am a student at the University of Wyoming. My major is biology."
str2 <- "I love statistics and I enjoy working with numbers. I graduated from Walla Wall Community College"
str1 <- tolower(str1) # we use tolower because "university" != "University"
str2 <- tolower(str2)
grep(paste(education, collapse = "|"), unlist(strsplit(str1, "(?<=\\.)\\s+",
perl = TRUE)),
value = TRUE)
grep(paste(education, collapse = "|"), unlist(strsplit(str2, "(?<=\\.)\\s+",
perl = TRUE)),
value = TRUE)