无法抓取遍历多页的内容

时间:2018-09-18 21:01:39

标签: php curl web-scraping simple-html-dom

我已经在Reading package lists... Done Building dependency tree Reading state information... Done E: Unable to locate package python3-pip 中编写了一个脚本,用于从网页中抓取php及其titles。该网页显示其内容遍历多个页面。我的下方脚本可以从其着陆页解析linkstitles

如何纠正现有脚本以从多页中获取数据,例如最多10页?

这是我到目前为止的尝试:

links

该网站会逐步增加其<?php include "simple_html_dom.php"; $link = "https://stackoverflow.com/questions/tagged/web-scraping?page=2"; function get_content($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); $htmlContent = curl_exec($ch); curl_close($ch); $dom = new simple_html_dom(); $dom->load($htmlContent); foreach($dom->find('.question-summary') as $file){ $itemTitle = $file->find('.question-hyperlink', 0)->innertext; $itemLink = $file->find('.question-hyperlink', 0)->href; echo "{$itemTitle},{$itemLink}<br>"; } } get_content($link); ?> ?page=2之类的页面。

2 个答案:

答案 0 :(得分:0)

这是我如何使用XPath来做到这一点:

$url = 'https://stackoverflow.com/questions/tagged/web-scraping';

$dom = new DOMDocument();
$source = loadUrlSource($url);
$dom->loadHTML($source);

$xpath = new DOMXPath($dom);
$domPage = new DOMDocument();
$domPage->loadHTML($source);
$xpath_page = new DOMXPath($domPage);

// Find page links with the title "go to page" within the div container that contains "pager" class.
$pageItems = $xpath_page->query("//div[contains(@class, 'pager')]//a[contains(@title, 'go to page')]");

// Get last page number. 
// Since you will look once at the beginning for the page number, subtract by 2 because the link "next" has title "go to page" as well.
$pageCount = (int)$pageItems[$pageItems->length-2]->textContent;

// Loop every page
for($page=1; $page < $pageCount; $page++) {

    $source = loadUrlSource($url . "?page={$page}");

    // Do whatever with the source. You can also call simple_html_dom on the content.
    // $dom = new simple_html_dom();
    // $dom->load($source);

}

答案 1 :(得分:0)

这就是我获得成功的方式(应对 Nima的 建议)。

<?php
include "simple_html_dom.php";
$link = "https://stackoverflow.com/questions/tagged/web-scraping?page="; 

function get_content($url)
    {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        $htmlContent = curl_exec($ch);
        curl_close($ch);
        $dom = new simple_html_dom();
        $dom->load($htmlContent);
        foreach($dom->find('.question-summary') as $file){
            $itemTitle = $file->find('.question-hyperlink', 0)->innertext;
            $itemLink = $file->find('.question-hyperlink', 0)->href;
            echo "{$itemTitle},{$itemLink}<br>";
        }
    }
for($i = 1; $i<10; $i++){
        get_content($link.$i);
    }
?>