我不熟悉使用R进行网络抓取应用程序。我已经开发了read.html函数的lapply函数来抓取多个页面,每个页面具有相同的页面结构。我想保存每个页面的原始.xml(或html)数据,然后重新加载以供以后使用。但是,我将其原始格式抓取起来并不容易保存并重新加载。以下是我尝试(但失败)保存一页的示例。
更一般而言,我希望有一种系统的方式来存储我抓取的所有页面,以便我可以重新加载它们,然后使用Apply函数分析所有页面。这可能吗?
以下是一些代码:
rm(list=ls(all=TRUE))
library('rvest')
library(dplyr)
library(jsonlite)
library(pander)
library(stringr)
library(purrr)
library(xml2)
library(XML)
url_test <- "http://www.winemag.com/?s=washington merlot&drink_type=wine&page="
pages <- c(1:5)
test.url.list <- paste0("http://www.winemag.com/?s=washington merlot&drink_type=wine&page=", pages)
test.webpages <- lapply(test.url.list,read_html)
# This produces some data:
html_text (html_node(test.webpages[[1]], 'body > div.off-canvas-wrap > div > section > section > div:nth-child(2) > div.large-9.columns > div.results-section.reviews.center-column > div.results > ul > li:nth-child(2) > a > div.title'))
# This gets data from all pages.
mapply(html_text, mapply(html_nodes, meta.webpages, 'body > div.off-canvas-wrap > div > section > section > div:nth-child(2) > div.large-9.columns > div.results-section.reviews.center-column > div.results > ul > li:nth-child(2) > a > div.title'))
#Now I try to save one of the pages:
write_xml(test.webpages[[1]], "tester2.xml",options="as_xml")
test.reload <- xmlParse("tester2.xml")
# When I run it, I get data back.... But it is in an xml file - not on 2 seperate items in a list.
test.reload
# I can't extract the data.
html_text (html_node(test.reload, 'body > div.off-canvas-wrap > div > section > section > div:nth-child(2) > div.large-9.columns > div.results-section.reviews.center-column > div.results > ul > li:nth-child(2) > a > div.title'))