.docx文件章节提取

时间:2017-02-15 10:41:19

标签: indexing docx extraction

我想提取.docx文件的内容,chaptervise。 因此,我的.docx文档有一个寄存器,每章都有一些内容

 1. Intro
   some text about Intro, these things, those things
 2. Special information
   these information are really special
    2.1 General information about the environment
      environment should be also important
    2.2 Further information 
      and so on and so on

所以最后收到一个Nx3矩阵会很棒,其中包含索引号,索引名和至少内容。

i_number     i_name                 content
1            Intro                  some text about Intro, these things, those things
2            Special Information    these information are really special
... 

感谢您的帮助

1 个答案:

答案 0 :(得分:0)

您可以在.txt中导出或复制粘贴.docx并应用此R脚本:

library(stringr)
library(readr)

doc <- read_file("filename.txt")

pattern_chapter <- regex("(\\d+\\.)(.{4,100}?)(?:\r\n)", dotall = T)

i_name <- str_match_all(doc, pattern_chapter)[[1]][,1]
paragraphs <- str_split(doc, pattern_chapter)[[1]]
content <- paragraphs[-which(paragraphs=="")]

result <- data.frame(i_name, content)
result$i_number <- seq.int(nrow(result))

View(result)

如果您的文档包含任何类型的行不是以数字开头的标题(例如,脚注或编号列表),则它不起作用

(拜托,请不要盲目的downvote:这个脚本与给出的例子完美配合)