Question

我无法弄清楚如何阅读R中语料库中每个文档的前两行。前两行包含我要分析的新闻文章的标题。我想搜索“堕胎”这个词的标题（而不是每个文本的其余部分）。＆＃39;

以下是我创建语料库的代码：

myCorp <- corpus(readtext(file='~/R/win-library/3.3/quanteda/Abortion/1972/*'))

我尝试在for循环中使用readLines：

for (mycorp in myCorp) {
titles <- readLines(mycorp, n = 2)
write.table(mycorp, "1972_text_P.txt", sep="\n\n", append=TRUE)
write.table(titles, "1972_text_P.txt", append=TRUE)
}

readLines（mycorp，n = 2）出错：＆＃39; con＆＃39;不是连接

我故意没有创建DFM，因为我想将465个文件保存为语料库中的单个文档。我怎样才能从文章textx中获得头条新闻？或者，理想情况下，我如何仅搜索每个文档的前两行中的关键字（堕胎）并创建一个仅包含其中包含关键字的标题的文件？感谢您提供的任何帮助。

Answer 1

npm link函数需要一个连接对象作为参数。因为readLines函数不返回连接，所以需要在循环中创建与语料库中字符串的连接。

corpus

Answer 2

我建议两个选项：

正则表达式替换只保留前2行

如果您的前两行包含您需要的内容，则只需使用正则表达式提取它们，即拔出前两行。这比循环更快。

@ rconradin的解决方案有效，但正如您将在语料库中注意到的那样，我们强烈反对直接访问语料库对象的内部（因为它很快就会改变）。不循环也更快。

# test corpus for demonstration
testcorp <- corpus(c(
    d1 = "This is doc1, line 1.\nDoc1, Line 2.\nLine 3 of doc1.",
    d2 = "This is doc2, line 1.\nDoc2, Line 2.\nLine 3 of doc2."
))

summary(testcorp)
## Corpus consisting of 2 documents.
## 
##  Text Types Tokens Sentences
##    d1    12     17         3
##    d2    12     17         3

现在只用前两行覆盖文本。（这也会丢弃第二个换行符，如果你想保留它，只需将其移动到第一个捕获组。）

texts(testcorp) <- 
    stringi::stri_replace_all_regex(texts(testcorp), "(.*\\n.*)(\\n).*", "$1")
## Corpus consisting of 2 documents.
## 
##  Text Types Tokens Sentences
##    d1    10     12         2
##    d2    10     12         2

texts(testcorp)
##                                     d1                                     d2 
## "This is doc1, line 1.\nDoc1, Line 2." "This is doc2, line 1.\nDoc2, Line 2."

使用`corpus_segment()`：

另一种解决方案是使用corpus_segment()：

testcorp2 <- corpus_segment(testcorp, what = "other", delimiter = "\\n", 
                            valuetype = "regex")
summary(testcorp2)
## Corpus consisting of 6 documents.
## 
##  Text Types Tokens Sentences
##  d1.1     7      7         1
##  d1.2     5      5         1
##  d1.3     5      5         1
##  d2.1     7      7         1
##  d2.2     5      5         1
##  d2.3     5      5         1

# get the serial number from each docname
docvars(testcorp2, "sentenceno") <- 
    as.integer(gsub(".*\\.(\\d+)", "\\1", docnames(testcorp2)))
summary(testcorp2)
## Corpus consisting of 6 documents.
## 
##  Text Types Tokens Sentences sentenceno
##  d1.1     7      7         1          1
##  d1.2     5      5         1          2
##  d1.3     5      5         1          3
##  d2.1     7      7         1          1
##  d2.2     5      5         1          2
##  d2.3     5      5         1          3

testcorp3 <- corpus_subset(testcorp2, sentenceno <= 2)
texts(testcorp3)
##                    d1.1                    d1.2                    d2.1                    d2.2 
## "This is doc1, line 1."         "Doc1, Line 2." "This is doc2, line 1."         "Doc2, Line 2."

阅读R中语料库中每个文档的前两行

2 个答案:

正则表达式替换只保留前2行

使用corpus_segment()：

使用`corpus_segment()`：