如何在大型文本语料库中“有效地”用另一个向量对(成对)替换字符串向量

时间:2019-03-30 20:19:19

标签: r text-mining gsub large-data

我在字符串向量(大约700.000字符串)中有大量文本。我正在尝试替换语料库中的特定单词/词组。也就是说,我有一个约40.000个短语的向量和一个对应的替换向量。

我正在寻找解决问题的有效方法

我可以在for循环中完成它,循环遍历每个模式+替换。但是它伸缩性很差(大约3天!)

我也尝试过qdap :: mgsub(),但它似乎也无法很好地缩放

txt <- c("this is a random sentence containing bca sk", 
"another senctence with bc a but also with zqx tt",
"this sentence contains non of the patterns", 
"this sentence contains only bc a")

patterns <- c("abc sk", "bc a", "zqx tt")

replacements <- c("@a-specfic-tag-@abc sk", 
"@a-specfic-tag-@bc a", 
"@a-specfic-tag-@zqx tt")

#either
txt2 <- qdap::mgsub(patterns, replacements, txt)
#or
for(i in 1:length(patterns)){
    txt  <- gsub(patterns[i], replacements[i], txt)
}

这两种解决方案都无法使用我的应用程序40.000个模式/替换和700.000个txt字符串扩展我的数据

我认为必须有一种更有效的方法?

3 个答案:

答案 0 :(得分:1)

如果您可以先对文本进行标记,那么矢量化替换会更快。如果a)您可以使用多线程解决方案,b)使用固定而不是正则表达式匹配,则速度也会更快。

quanteda 包中,这是所有方法。如果需要的话,最后一行将标记作为字符向量粘贴回单个“文档”中。

library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
quanteda_options(threads = 4)

txt <- c(
  "this is a random sentence containing bca sk",
  "another sentence with bc a but also with zqx tt",
  "this sentence contains none of the patterns",
  "this sentence contains only bc a"
)
patterns <- c("abc sk", "bc a", "zqx tt")
replacements <- c(
  "@a-specfic-tag-@abc sk",
  "@a-specfic-tag-@bc a",
  "@a-specfic-tag-@zqx tt"
)

这将标记文本,然后使用固定模式匹配(但您可以使用valuetype = "regex"进行正则表达式匹配)来快速替换哈希类型。通过将patterns包装在phrases()函数内部,您可以告诉tokens_replace()寻找令牌序列而不是单个匹配项,因此可以解决多字问题。

toks <- tokens(txt) %>%
  tokens_replace(phrase(patterns), replacements, valuetype = "fixed")
toks
## tokens from 4 documents.
## text1 :
## [1] "this"       "is"         "a"          "random"     "sentence"  
## [6] "containing" "bca"        "sk"        
## 
## text2 :
## [1] "another"                "sentence"              
## [3] "with"                   "@a-specfic-tag-@bc a"  
## [5] "but"                    "also"                  
## [7] "with"                   "@a-specfic-tag-@zqx tt"
## 
## text3 :
## [1] "this"     "sentence" "contains" "none"     "of"       "the"     
## [7] "patterns"
## 
## text4 :
## [1] "this"                 "sentence"             "contains"            
## [4] "only"                 "@a-specfic-tag-@bc a"

最后,如果您真的想将其重新设置为字符格式,请转换为字符类型列表,然后将它们粘贴在一起。

sapply(as.list(toks), paste, collapse = " ")
##                                                                             text1 
##                                     "this is a random sentence containing bca sk" 
##                                                                             text2 
## "another sentence with @a-specfic-tag-@bc a but also with @a-specfic-tag-@zqx tt" 
##                                                                             text3 
##                                     "this sentence contains none of the patterns" 
##                                                                             text4 
##                                "this sentence contains only @a-specfic-tag-@bc a"

您必须在大型语料库上对此进行测试,但是700k字符串听起来并不意味着一项太大的任务。请尝试此操作并报告其效果!

答案 1 :(得分:0)

创建每个短语中所有单词的向量

txt1 = strsplit(txt, " ")
words = unlist(txt1)

使用match()查找要替换的单词索引,然后替换它们

idx <- match(words, patterns)
words[!is.na(idx)] = replacements[idx[!is.na(idx)]]

重新整理短语并将其粘贴在一起

phrases = relist(words, txt1)
updt = sapply(phrases, paste, collapse = " ")

我想如果模式可以包含多个单词,那么这是行不通的...

答案 2 :(得分:0)

在新旧值之间创建映射

map <- setNames(replacements, patterns)

创建在单个正则表达式中包含所有模式的模式

pattern = paste0("(", paste0(patterns, collapse="|"), ")")

找到所有匹配项,并将其提取

ridx <- gregexpr(pattern, txt)
m <- regmatches(txt, ridx)

取消列出,映射和重新列出匹配项以将其替换为值,并更新原始向量

regmatches(txt, ridx) <- relist(map[unlist(m)], m)