Question

我有一些我已进行OCR验证的文字。 OCR放置了许多不应该使用的换行符（\ n）。但是也错过了许多应该存在的新生产线。

我想删除现有的换行符并用空格替换。然后在原始文本中用换行符替换特定字符。然后将文档转换为Quanteda中的语料库。

我可以创建一个基本的语料库。但是麻烦是我无法将其分解成几段。如果我使用
corpus_reshape（corps，to =“ paragraphs”，use_docvars = TRUE）不会破坏文档。

如果我使用corpus_segment（corps，pattern =“ \ n”）

我得到一个错误。

rm(list=ls(all=TRUE))
library(quanteda)
library(readtext)

# Here is a sample Text
sample <- "Hello my name is Christ-
ina. 50 Sometimes we get some we-


irdness

Hello my name is Michael, 
sometimes we get some weird,


 and odd, results-- 50 I want to replace the 
 50s
"



# Removing the existing breaks
sample <- gsub("\n", " ", sample)
sample <- gsub(" {2,}", " ", sample)
# Adding new breaks
sample <- gsub("50", "\n", sample)

# I can create a corpus
corps <- corpus(sample, compress = FALSE)
summary(corps, 1)

# But I can't change to paragraphs
corp_para <- corpus_reshape(corps, to ="paragraphs", use_docvars = TRUE)
summary(corp_para, 1)

# But I can't change to paragraphs
corp_para <- corpus_reshape(corps, to ="paragraphs", use_docvars = TRUE)
summary(corp_para, 1)

corp_segmented <-  corpus_segment(corps, pattern = "\n")

# The \n characters are in both documents.... 
corp_para$documents$texts
sample

Answer 1

我建议先使用正则表达式替换来清理文本，然后再将其转换成语料库。文字中的技巧是弄清楚您要删除换行符的位置以及保留换行符的位置。我从您的问题中猜测，您想删除出现的“ 50”，但也可能将连字符和换行符分开。您可能还想在文本之间保留两个换行符？

许多用户更喜欢 stringr 软件包的简单界面，但我一直倾向于使用 stringi （在其上构建 stringr ）。它允许向量化替换，因此您可以在一个函数调用中向其提供要匹配的模式向量和替换。

library("stringi")

sample2 <- stri_replace_all_regex(sample, c("\\-\\n+", "\\n+", "50"), c("", "\n", "\n"),
  vectorize_all = FALSE
)
cat(sample2)
## Hello my name is Christina. 
##  Sometimes we get some weirdness
## Hello my name is Michael, 
## sometimes we get some weird,
##  and odd, results-- 
##  I want to replace the 
##  
## s

在这里，您将"\\n"作为正则表达式 pattern 进行匹配，但仅将"\n"用作（文字）替换。

在替换文本的最后一个“ s”之前有两个换行符，因为a）在“ 50s”中的“ s”之后已经有一个换行符，并且b）通过用新的"\n"替换50来添加一个换行符

现在您可以使用quanteda::corpus(sample2)构建语料库。

在正则表达式中用\ n替换字符，然后将文本转换为Quanteda语料库

1 个答案: