当多个节点<p> </p>或兄弟节点位于父<div>下时创建data.frame

时间:2017-08-23 17:58:33

标签: r

如下页所示,多个段落位于<div>元素下,导致在抓取页面时仅打印第一个<p>

http://www.epilepsy.com/connect/forums/living-epilepsy-adults/anyone-else-w-connection-vietnam-war

我尝试使用下面的代码添加所有<p>元素

content = html_text(html_node(h, 'div.field-item.even > p'))

然后在提取所有<p>时,无法保存数据框 (错误说“替换有6行,数据有1”)

有谁知道如何解决这个问题?感谢您的帮助。

dataf <- data.frame(title=c(), content=c())
dataf
post.num <- 1

for(link in article_href){
    link = sprintf('http://www.epilepsy.com%s', link)
    print(link)
    h = read_html(link)

    title = html_text(html_node(h, 'div.panel-pane.pane-node-title.no-title.block'))
    title <- str_trim(title)
    str_replace_all(title, '[[:space:]]', '')
    print(title)


    content = html_text(html_node(h, 'div.field-item.even > p'))
    print(content)
    dataf[post.num, 'content'] = content


    post.num <- post.num + 1

  }

1 个答案:

答案 0 :(得分:1)

如果您不关心文本内容中的格式,您只需选择<p>的父节点:

library(dplyr)
library(rvest)
library(stringr)

h = read_html('http://www.epilepsy.com/connect/forums/living-epilepsy-adults/anyone-else-w-connection-vietnam-war')

title <- h %>% html_node('div.pane-node-title h2') %>% html_text(trim = TRUE)
print(title)

content <- h %>% html_node('.field-name-field-body') %>% html_text(trim = TRUE)
print(content)