Question

我正在尝试学习如何构建一个从群组中获取信息的Facebook群组抓取工具（群组中的帖子列表，其中包含撰写帖子的帖子，帖子ID，发布日期等等。

对我来说，重要的是要说我正处于页面抓取研究的开始阶段！

从此页面找到一个很好的教程： http://www.oooff.com/php-scripts/basic-curl-scraping-php/basic-scraping-with-curl.php

运行此代码时：

<?php
    $url = "http://www.oooff.com/";

    $ch = curl_init($url);                              // initialize the CURL library in my PHP script so we can later work on it - inside the handler. 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);     // curl_setopt() function is used to set options on the $ch handler.// in this case we use the CURLOPT_RETURNTRANSFER option 
    $curl_scraped_page = curl_exec($ch);                //  "run all the stuff we've set" - return the data scraped to the variable $curl_scraped_page
    curl_close($ch);



    echo $curl_scraped_page;
?>

它有效，但有时当我运行它时，我得到一个空白页面当我在Facebook上运行它（或者更具体地说是FB组，因为这就是我需要的）我得到一个空白页面。我尝试在yahoo.com上运行它，我得到了相同的结果。

为什么会这样？
获取网页内容的正确方法是什么？

Answer 1

如果您主要对Facebook内容感兴趣，可以使用facebook api for php： https://developers.facebook.com/docs/reference/php/

CURL只会加载文件内容，但不会运行网页的JavaScript。

根据Vivin Paliath answer PhantomJs可能是从JavaScript网页获取内容的好方法：

[...] PhantomJS是一款无头WebKit浏览器。它有自己的API，可以让你“编写”脚本行为。因此，您可以告诉PhantomJS加载页面并转储所需的数据。

为什么CURL dosn会在像facebook这样的页面上运行？

1 个答案: