Question

我的目标是在一个非常大的word文档上使用library(tm)工具包。单词文档具有合理的排版，因此我们对主要部分h1有一些h2和h3副标题。我想对每个部分进行比较和文本处理（每个部分下面的文本h1 - 副标题不重要 - 因此可以包含或排除它们。）

我的策略是将worddocument导出为html，然后使用rvest pacakge提取段落。

library(rvest)
# the file has latin-1 chars
#Sys.setlocale(category="LC_ALL", locale="da_DK.UTF-8")
# small example html file
file <- rvest::html("https://83ae1009d5b31624828197160f04b932625a6af5.googledrive.com/host/0B9YtZi1ZH4VlaVVCTGlwV3ZqcWM/tidy.html", encoding = 'utf-8')

nodes <- file %>%
  rvest::html_nodes("h1>p") %>%
  rvest::html_text()

我可以用<p>提取所有html_nodes("p")，但这只是一大汤。我需要分别对每个h1进行分析。

最好的可能是一个列表，每个p标题都带有h1个标记的向量。也许是一个像for (i in 1:length(html_nodes(fil, "h1"))) (html_children(html_nodes(fil, "h1")[i]))这样的循环（它不起作用）。

如果有办法在rvest

中整理单词html，可以获得奖励

Answer 1

请注意> is the child combinator;您当前拥有的选择器会查找p 的子的h1元素，这些元素在HTML中没有意义，因此不会返回任何内容。

如果您检查生成的标记，至少在您提供的示例文档中，您会注意到每个h1元素（以及目录的标题，标记为p的相关内容具有关联的父div：

<body lang="EN-US"> <div class="WordSection1"> <p class="MsoTocHeading"><span lang="DA" class='c1'>Indholdsfortegnelse</span></p> ... </div><span lang="DA" class='c5'><br clear="all" class='c4'></span> <div class="WordSection2"> <h1><a name="_Toc285441761"><span lang="DA">Interview med Jakob skoleleder på a_skolen</span></a></h1> ... </div><span lang="DA" class='c5'><br clear="all" class='c4'></span> <div class="WordSection3"> <h1><a name="_Toc285441762"><span lang="DA">Interviewet med Andreas skoleleder på b_skolen</span></a></h1> ... </div> </body>

p表示的每个部分中的所有h1元素都在其各自的父div中找到。考虑到这一点，您只需选择p元素作为每个h1的兄弟元素。但是，由于rvest当前没有办法从上下文节点中选择兄弟节点（html_nodes()仅支持查看节点的子树，即其后代），您需要另外执行此操作方式。

假设HTML Tidy创建了一个结构，其中h1中的每个div都在body内，您可以使用以下内容获取除目录之外的每个div以下选择器：

sections <- html_nodes(file, "body > div ~ div")

在您的示例文档中，这应该会产生div.WordSection2和div.WordSection3。目录由div.WordSection1表示，并且从选择中排除。

然后从每个div中提取段落：

for (section in sections) { paras <- html_nodes(section, "p") # Do stuff with paragraphs in each section... print(length(paras)) } # [1] 9 # [1] 8

如您所见，length(paras)对应于每个p中div个元素的数量。请注意，其中一些只包含 ，根据您的需要可能很麻烦。我将把这些异常值作为练习留给读者。

不幸的是，作为rvest，我没有奖励积分，也没有提供自己的HTML Tidy功能。您需要单独处理Word文档。

刮掉标题下的所有子段落（最好是rvest）

1 个答案: