Question

我尝试将某些字符串更改为train中列出的已标记字词的形式。

train = c('love/POS','happy/POS','sad/NEG','fearsome/NEG','lazy/NEG')
test = c('I love you', 'I am so happy now', 'You look sad somehow', 'the lazy boy look so fearsome')

有了他们，我想做出像

这样的结果

[1]'I love/POS you' 'I am so happy/POS now' 'You look sad/NEG somehow' 'the lazy/NEG boy look so fearsome/NEG'

当然，我可以像这样使用gsub作为原始方式

part1 = gsub('love', 'love/POS', test)
part2 = gsub('happy', 'happy/POS', part1)
.......

然而，当我有更大的训练名单时，这种方式根本没有效率。

为了能够以更有效的方式实现，我尝试了

process1 = unlist(strsplit(test, '[[:space:]]+'))

mgsub <- function(pattern, replacement, x, ...) {
  if (length(pattern)!=length(replacement)) {
    stop("pattern and replacement do not have the same length.")
  }
  result <- x
  for (i in 1:length(pattern)) {
    result <- gsub(pattern[i], replacement[i], result, ...)
  }
  result
}

trainedtest = mgsub(process1, train, test)
trainedtest

事实上，它根本不起作用，因为process1和train列表的长度不同。从技术上讲，我应该制作一个程序，可以选择某些单词来改变成标记形式的列车清单，计算process1和train之间的相似性。

有没有办法让它成为可能？

Answer 1

以下是使用match和nomatch = 0的基本R解决方案（即不匹配任何内容 - 默认为NA）

v1 <- sub('/.*', '', train)
sapply(strsplit(test, ' '), function(i)
       {i[grepl(paste(v1, collapse = '|'), i)] <- train[match(i, v1, nomatch = 0)]; 
                                                              paste(i, collapse = ' ')})

#[1] "I love/POS you"    "I am so happy/POS now"  "You look sad/NEG somehow"             
#[4] "the lazy/NEG boy look so fearsome/NEG"

Answer 2

如果要使用所需字符串替换多个模式，请使用gsubfn：

require(gsubfn)
input = c("I love you", "I am so happy now")
toreplace<-list("love" = "love/POS", "happy" = "happy/POS")
gsubfn(paste(names(toreplace),collapse="|"),toreplace, input)

如何将某些单词更改为标记形式的培训列表

2 个答案: