Question

我使用rvest读取表格和标准html取得了很大成功。我现在遇到一个问题，就是阅读包含多个引号的文本。当引用文本和空格后出现新的引用文本行时，似乎rvest添加了一个新字母（a-z）。

这是一个可重复的例子。

library(rvest)
read_html("https://www.lds.org/scriptures/ot/gen/1?lang=eng") %>% 
  html_node("#p3") %>%
  html_text()

结果是

"3 And God asaid, Let there be blight: and there was light."

错误的单词是“asaid”和“blight”。洛尔

有关更多参考，请使用Web检查器查看html结构。

<p class="verse" id="p3>
<span class="verse-number verse">3</span>
"And God "
"said"
", Let there be "
"light"
": and there was light."
</p>

我想知道这种格式不正确的文本的解决方案是什么。

Answer 1

如果您查找并单击“显示脚注”，然后再次检查页面，您将看到问题。 “asaid”中的额外字母“a”和“blight”中的“b”是隐藏脚注的文本，包含在sup标签中。

page <- read_html("https://www.lds.org/scriptures/ot/gen/1?lang=eng")
page %>% 
  html_nodes(xpath = "//p[@id = 'p3']") %>% 
  html_structure()

[[1]]
<p#p3 .verse [data-aid]>
  <span.verse-number.verse>
    {text}
  {text}
  <a.footnote.study-note-ref [href, rel]>
    <sup.studyNoteMarker.dontHighlight>
      {text}
    {text}
  {text}
  <a.footnote.study-note-ref [href, rel]>
    <sup.studyNoteMarker.dontHighlight>
      {text}
    {text}
  {text}

因此，一个解决方案（它有点混乱）是提取sup节点，然后从节点集中删除它们。

footnotes <- page %>% 
  html_nodes(xpath = "//p[@id = 'p3']//sup")

xml_remove(footnotes)
page %>% 
  html_nodes(xpath = "//p[@id = 'p3']") %>% 
  html_text()

[1] "3 And God said, Let there be light: and there was light."

用rvest读取奇怪的引用文本

1 个答案: