我想提取.docx
文件的内容,chaptervise。
因此,我的.docx
文档有一个寄存器,每章都有一些内容
1. Intro
some text about Intro, these things, those things
2. Special information
these information are really special
2.1 General information about the environment
environment should be also important
2.2 Further information
and so on and so on
所以最后收到一个Nx3
矩阵会很棒,其中包含索引号,索引名和至少内容。
i_number i_name content
1 Intro some text about Intro, these things, those things
2 Special Information these information are really special
...
感谢您的帮助
答案 0 :(得分:0)
您可以在.txt中导出或复制粘贴.docx并应用此R脚本:
library(stringr)
library(readr)
doc <- read_file("filename.txt")
pattern_chapter <- regex("(\\d+\\.)(.{4,100}?)(?:\r\n)", dotall = T)
i_name <- str_match_all(doc, pattern_chapter)[[1]][,1]
paragraphs <- str_split(doc, pattern_chapter)[[1]]
content <- paragraphs[-which(paragraphs=="")]
result <- data.frame(i_name, content)
result$i_number <- seq.int(nrow(result))
View(result)
如果您的文档包含任何类型的行不是以数字开头的标题(例如,脚注或编号列表),则它不起作用
(拜托,请不要盲目的downvote:这个脚本与给出的例子完美配合)