Question

我正在尝试从网页中下载一些文件，这些网页中的R语言具有javascript呈现的内容，这一直困扰着我。

文件在表中。我的想法是读取和检索页面，刮取表格，标识URL并下载文件。这是第一步：读取和检索页面。

经过一些搜索，我发现了使用phantomjs的解决方案，对我来说似乎很好。我不精通JS，因此我可以理解代码，但是对于如何在我的方案中实现这一目标我却知之甚少。

我当前的脚本是：

// scrape_super_data_science_ml_data.js

var webPage = require('webpage');
var page = webPage.create();

var fs = require('fs');
var path = 'super_data_science_ml_data.html'

page.open('https://www.superdatascience.com/pages/machine-learning', function (status) {
  var content = page.content;
  fs.write(path,content,'w')
  phantom.exit();
});

调用后，将下载页面，但不包含JS呈现的内容。我不知道在完成页面检索之前渲染内容是否是时间问题，还是其他原因。

这是我在R中的过程的示例：

# Scrape page
system("phantomjs scrape_super_data_science_ml_data.js")

# Check results
library(rvest)
library(dplyr)
page <- read_html("super_data_science_ml_data.hmtl")
page %>%  html_text()

有人可以帮助我吗？任何提示将不胜感激！

Answer 1

我不确定这是否是您使用的确切代码，但是您发布的代码中有一些错误。对于phantomjs代码，我使用

var system = require('system');
var page = require('webpage').create();


page.open('https://www.superdatascience.com/pages/machine-learning', function()
{
    console.log(page.content);
    phantom.exit();
});

然后我用

调用R中的代码

# Scrape page
system("phantomjs scrape_super_data_science_ml_data.js > super_data_science_ml_data.html")

# Check results
library(rvest)
library(dplyr)
page <- read_html("super_data_science_ml_data.html")
page %>%  html_text()

第一个错误是您忘记了R用system()保存html，第二个错误是拼写错误“ super_data_science_ml_data.hmtl”

关于您的渲染问题，phantomjs vs rvest的主要目的之一是它渲染js，因为它是无头浏览器，而不是像rvest这样的更简单的抓取器。

Scrape js使用phantonjs渲染内容

1 个答案: