使用正则表达式组合R中的单词

时间:2016-10-31 14:07:13

标签: r regex

我有非结构化文本,我想组合一些单词,以便保留我的文本挖掘任务的概念。例如,在下面的字符串中,我想改变"高压" in to" High_pressure"," not working"进入" not_working"和#34;没有空气"进入" No_air"。

示例文字

c(" High pressure was the main problem in the machine","the system is not working right now","No air in the system")

单词列表

c('low', 'high', 'no', 'not')

期望的输出

# [1] " High_pressure was the main problem in the machine"
# [2] "the system is not_working right now"               
# [3] "No_air in the system"    

1 个答案:

答案 0 :(得分:2)

首先,保存文本输入和要连接的修改单词列表:

textIn <- 
  c(" High pressure was the main problem in the machine","the system is not working right now","No air in the system")

prefix <- c("high", "low", "no", "not")

然后,构建一个捕获那些单词后跟空格的正则表达式。请注意,我使用的是\b,以确保我们不会意外地将这些内容捕获为单词的结尾,例如&#34;慢&#34;

gsub(
  paste0("\\b(", paste(prefix, collapse = "|"),") ")
  , "\\1_", textIn, ignore.case = TRUE
)

返回

[1] " High_pressure was the main problem in the machine"
[2] "the system is not_working right now"          
[3] "No_air in the system"