我有非结构化文本,我想组合一些单词,以便保留我的文本挖掘任务的概念。例如,在下面的字符串中,我想改变"高压" in to" High_pressure"," not working"进入" not_working"和#34;没有空气"进入" No_air"。
示例文字
c(" High pressure was the main problem in the machine","the system is not working right now","No air in the system")
单词列表
c('low', 'high', 'no', 'not')
期望的输出
# [1] " High_pressure was the main problem in the machine"
# [2] "the system is not_working right now"
# [3] "No_air in the system"
答案 0 :(得分:2)
首先,保存文本输入和要连接的修改单词列表:
textIn <-
c(" High pressure was the main problem in the machine","the system is not working right now","No air in the system")
prefix <- c("high", "low", "no", "not")
然后,构建一个捕获那些单词后跟空格的正则表达式。请注意,我使用的是\b
,以确保我们不会意外地将这些内容捕获为单词的结尾,例如&#34;慢&#34;
gsub(
paste0("\\b(", paste(prefix, collapse = "|"),") ")
, "\\1_", textIn, ignore.case = TRUE
)
返回
[1] " High_pressure was the main problem in the machine"
[2] "the system is not_working right now"
[3] "No_air in the system"