Question

我有非结构化文本，我想组合一些单词，以便保留我的文本挖掘任务的概念。例如，在下面的字符串中，我想改变＆＃34;高压＆＃34; in to＆＃34; High_pressure＆＃34;，＆＃34; not working＆＃34;进入＆＃34; not_working＆＃34;和＃34;没有空气＆＃34;进入＆＃34; No_air＆＃34;。

示例文字

c(" High pressure was the main problem in the machine","the system is not working right now","No air in the system")

单词列表

c('low', 'high', 'no', 'not')

期望的输出

# [1] " High_pressure was the main problem in the machine"
# [2] "the system is not_working right now"               
# [3] "No_air in the system"

Answer 1

首先，保存文本输入和要连接的修改单词列表：

textIn <- 
  c(" High pressure was the main problem in the machine","the system is not working right now","No air in the system")

prefix <- c("high", "low", "no", "not")

然后，构建一个捕获那些单词后跟空格的正则表达式。请注意，我使用的是\b，以确保我们不会意外地将这些内容捕获为单词的结尾，例如＆＃34;慢＆＃34;

gsub(
  paste0("\\b(", paste(prefix, collapse = "|"),") ")
  , "\\1_", textIn, ignore.case = TRUE
)

返回

[1] " High_pressure was the main problem in the machine"
[2] "the system is not_working right now"          
[3] "No_air in the system"

使用正则表达式组合R中的单词

1 个答案: