我正在尝试对此website进行网络抓取。
我正在使用与Webscrape页面相同的代码:
url_dv1 <- "https://ec.europa.eu/commission/presscorner/detail/en/qanda_20_171?fbclid=IwAR2GqXLmkKRkWPoy3-QDwH9DzJiexFJ4Sp2ZoWGbfmOR1Yv8POdlLukLRaU"
url_dv1 <- paste(html_text(html_nodes(read_html(url_dv1), "#inline-nav-1 .ecl-paragraph")), collapse = "")
以为这个网站的代码似乎不起作用。实际上,我在UseMethod(“ read_xml”)中遇到错误: 没有适用于“ read_xml”的适用方法应用于类“ c('xml_document','xml_node')”的对象。
为什么会这样?我该如何解决?
非常感谢!
答案 0 :(得分:1)
问题是网页是动态呈现的。您可以使用phantomjs(可在此处https://phantomjs.org/download.html下载)克服它。您还需要一个自定义的javascript脚本(请参见下文)。下面的R代码对我有用。
library(tidyverse)
library(rvest)
dir_js <- "path/to/a/directory" # JS code needs to be inserted here, the name of the file needs to be javascript.js
url <- "https://ec.europa.eu/commission/presscorner/detail/en/qanda_20_171?fbclid=IwAR2GqXLmkKRkWPoy3-QDwH9DzJiexFJ4Sp2ZoWGbfmOR1Yv8POdlLukLRaU"
system2("path/to/where/you/have/phantomjs.exe", # directory to phantomJS
args = c(file.path(dir_js, "javascript.js"), url))
read_html("myhtml.html") %>%
html_nodes("#inline-nav-1 .ecl-paragraph") %>%
html_text()
# this is the javascript code to be saved in javascript directory as javascript.js
// create a webpage object
var page = require('webpage').create(),
system = require('system')
// the url for each country provided as an argument
country= system.args[1];
// include the File System module for writing to files
var fs = require('fs');
// specify source and path to output file
// we'll just overwirte iteratively to a page in the same directory
var path = 'myhtml.html'