Question

我正在尝试从OTC市场（在其robots.txt的范围内）收集一些数据，但无法连接到网页。

我尝试的第一步只是从页面上抓取HTML，但是该页面需要加载javascript。
所以我下载了phantomjs并以这种方式连接。但是，这会导致404错误页面
然后，我将用户代理更改为类似于用户的名称，以查看它是否可以使我连接，但仍然不走运！这是怎么回事

这是我的代码的可复制版本，任何帮助将不胜感激。 Phantomjs可以在这里下载：http://phantomjs.org/

library(rvest)
library(xml2)
library(V8)
# example website, I have no correlation to this stock
url <- 'https://www.otcmarkets.com/stock/YTROF/profile' 

# create javascript file that phantomjs can process
writeLines(sprintf("var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';
page.open('%s', function () {
    console.log(page.content); //page source
    phantom.exit();
});", url), con="scrape.js")

html <- system("phantomjs.exe_PATH scrape.js", intern = TRUE)
page_html <- read_html(html)

Phantomjs在尝试进行网络抓取时在R中返回404

0 个答案: