Question

我正在尝试为openNLP名称查找器创建培训数据，并感谢您提供的任何帮助。

如果我有这样的文本文件：

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. 
John Smith is chairman of Elsevier N.V., the Dutch publishing group.

以及第二个文件中的名称列表，例如：

Pierre Vinken
John Smith

是否有办法找到文本文件中所有名称的提及，并在文件中正确标记它们以创建训练数据，这样文件现在如下所示：

<START:CEO> Pierre Vinken <END>, 61 years old, will join the board as a nonexecutive director Nov. 29. 
<START:CEO> John Smith <END> is chairman of Elsevier N.V., the Dutch publishing group.

请注意，我知道需要其他预处理步骤才能使文件适合培训，例如将数据强制为每行一个句子。

我很感激Notepad ++或R中的解决方案，但我也可以根据需要访问shell工具。谢谢！

Answer 1

#using R
x1<-"Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. John Smith is chairman of Elsevier N.V., the Dutch publishing group."
y1<-c("Pierre Vinken","John Smith")
y2<-paste0("<START:CEO> ",y1[1:2]," <END>")
library(qdap)
mgsub(y1,y2,x1)
[1] "<START:CEO> Pierre Vinken <END>, 61 years old, will join the board as a nonexecutive director Nov. 29.<START:CEO> John Smith <END> is chairman of Elsevier N.V., the Dutch publishing group."

Answer 2

这是一种使用 qdapRegex 的方法（我维护）。这使用了基础gsub（使用分组方法）， qdapRegex 是不必要的，但我喜欢group和pastex的易用性（我展示了）如何在这里使用纯碱）。这使得一个正则表达式不需要mgsub。这可能会慢一些，因为它使用fixed = FALSE而mgsub使用fixed = TRUE。

x1<-"Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. John Smith is chairman of Elsevier N.V., the Dutch publishing group."
y1<-c("Pierre Vinken", "John Smith")

## pacman used to load and if missing install qdapRegex
if (!require("pacman")) install.packages("pacman")
pacman::p_load(qdapRegex)

gsub(pastex(group(y1)), "<START:CEO> \\1 <END>", x1, perl=TRUE)

## [1] "<START:CEO> Pierre Vinken <END>, 61 years old, will join the board as a nonexecutive director Nov. 29. <START:CEO>  <END> is chairman of Elsevier N.V., the Dutch publishing group."

纯碱

gsub(paste(sprintf("(%s)", y1), collapse="|"), 
    "<START:CEO> \\1 <END>", x1, perl=TRUE)

替换文本文件中列表中的所有项目

2 个答案: