Question

我正在尝试编写一个Web爬虫程序，但我不知道如何创建一个递归来解析一个网页并将所有的endresults添加到一个最终的数组中。我之前从未使用过php，但我在互联网上做了很多研究，并且已经知道了，如何解析我想要抓取的页面。
请注意，我已将$ url值和下面的数组结果更改为我在脑海中随机生成的一些值。

<?php
include_once "simple_html_dom.php"; //http://simplehtmldom.sourceforge.net/

$url = "https://www.scrapesite.com/pagetoscrape/index.html";

function parseLink($link) {
    $html = file_get_html($link);
    $html = $html->find("/html/body/script[2]/text", 0);
    preg_match('/\{(?:[^{}]|(?R))*\}/', $html, $matches); //this regex extracts a json array
    $json = json_decode($matches[0]);
    $data = ($json->props->contents);
    return $data;
}
function getFolders($basepath, $data) {
    $data = $data->folders;
    $result = array();

    foreach ($data as $value) {
        $result[] = array("folder", $basepath . "/" . $value->filename, $value->href);
    }

    return $result;
}

$data = getFolders("", parseLink($url));
print_r ($data);

?>

此脚本工作正常，并输出以下内容：

Array
(
    [0] => Array
        (
            [0] => folder
            [1] => /1
            [2] => https://www.scrapesite.com/pagetoscrape/sjdfi327943sad/index.html
        )

    [1] => Array
        (
            [0] => folder
            [1] => /2
            [2] => https://www.scrapesite.com/pagetoscrape/345fdsjjsdfsdf/index.html
        )

    [2] => Array
        (
            [0] => folder
            [1] => /3
            [2] => https://www.scrapesite.com/pagetoscrape/46589dsjodsiods/index.html
        )

    [3] => Array
        (
            [0] => folder
            [1] => /4
            [2] => https://www.scrapesite.com/pagetoscrape/345897dujfosfsd/index.html
        )

    [4] => Array
        (
            [0] => folder
            [1] => /5
            [2] => https://www.scrapesite.com/pagetoscrape/9dsfghshdfsds3/index.html
        )

)

现在，脚本应该对上面数组中的每个项执行getFolders函数。这可能会返回另一个应该解析的文件夹数组。然后我想创建一个最终数组，其中列出了所有文件夹ABSOLUTE路径（$ basepath。“/”。$ value-＆gt; filename）和href链接。

我真的很感激每一点暗示。我能够在网上找到一些例子，但我无法弄清楚如何在这里实现它，因为我几乎没有编程语言的经验。

Answer 1

初始化一个空数组并将其作为对getFolders()函数的引用传递。继续将抓取结果放在此数组中。此外，您需要在getFolders()的{{1}}循环内再次致电foreach。示例如下：

getFolders()

您的$finalResults = array(); getFolders("", parseLink($url), $finalResults);功能签名现在如下所示：

getFolders()

而且，你的foreach循环：

function getFolders($basepath, $data, &$finalResults) //notice the & before the $finalResults used for passing by reference

以上代码只是一个例子。根据您的需要进行更改。

PHP递归，结果放入单个数组

1 个答案: