循环遍历文本集合以提取子章节

时间:2019-05-15 13:15:03

标签: r dataframe

作为我的示例here的继续,我现在面临着一个问题,我想为R中的文档集中的所有文档提取子章以进行进一步的文本挖掘。这是我的样本数据

doc_title <- c("Example.docx", "AnotherExample.docx")
text <- c("One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.
      1 Introduction
      He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections. 
      1.1 Futher
      The bedding was hardly able to cover it and seemed ready to slide off any moment.", "2.2 Futher Fuhter
      'What's happened to me?' he thought. It wasn't a dream. His room, a proper human room although a little too small, lay peacefully between its four familiar walls.")

doc_corpus <- data.frame(doc_title, text)

这是将文本分为子章节的功能:

divideInto_subchapters <- function(doc_corpus){

  corpus_text <- doc_corpus$text

  # Replace lines starting with N.N.N+ with space
  corpus_text <- gsub("\\R\\d+(?:\\.\\d+){2,}\\s+[A-Z].*\\R?", " ", corpus_text, perl=TRUE)

  # Split into IDs and Texts
  data <- str_match_all(corpus_text, "(?sm)^(\\d+(?:\\.\\d+)?\\s+[A-Z][^\r\n]*)\\R(.*?)(?=\\R\\d+(?:\\.\\d+)?\\s+[A-Z]|\\z)")

  # Get the chapter ID column
  chapter_id <- trimws(data[[1]][,2])

  # Get the text ID column
  text <- trimws(data[[1]][,3])

  # Create the target DF
  corpus <- data.frame(doc_title, chapter_id, text)

  return(corpus)
}

现在,我想遍历doc_corpus中的所有元素,并将所有纯文本分为子章节。这是我到目前为止尝试过的:

subchapter_corpus <- data.frame()

for (i in 1:nrow(doc_corpus)) {
  temp_corpus <- divideInto_subchapters(doc_corpus[i])
  subchapter_corpus <- rbind(subchapter_corpus, temp_corpus)
}

不幸的是,这将返回一个空的数据帧。我这是怎么了?非常感谢您的帮助。 我在第一行df中的预期输出如下:

doc_title <- c("Example.docx")
chapter_id <- (c("1 Introduction")) 
text <- (c("He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.""))

chapter_one_df <- data.frame(doc_title, chapter_id, text)

1 个答案:

答案 0 :(得分:1)

因此,对我而言,循环使我“下标超出范围”,直到我将doc_corpus[i]更改为doc_corpus[i, ]。有了这一更改,我的确在结果数据框中得到了一行。

但是,只有chapter_id“ 2.2更进一步”。它似乎缺少“ 1.1 Futher”。

如果这是正则表达式的问题,那么伙计,如果您评论了自己在做什么,那肯定会有所帮助! :)

随时发表评论,我会根据需要修改答案,直到有帮助为止。不知道那是怎么回事,但这只是我回答SO问题的第三天。