获取分页链接

时间:2016-07-20 05:25:48

标签: php regex http curl preg-match-all

我是php新手。我想要做的是获取分页的链接。页面上有分页,当我们选择页面时,课程链接也会发生变化。如何通过停留在http://ahadith.co.uk/sahihmuslim.php的主页面来获取分页的网址。

<?php 
        $ch = curl_init(); 
        curl_setopt($ch, CURLOPT_URL, "http://ahadith.co.uk/sahihmuslim.php"); 
//fetches data from the site mentioned above
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
        $output = curl_exec($ch); 

        $pattern = "/href=[']([^'][a-zA-Z]+.[a-zA-Z]+.[cid]+=[0-9]+)[']?/";
//this regex brings the links from the above url
        preg_match_all($pattern, $output, $matches, PREG_PATTERN_ORDER);
        foreach ($matches[1] as $data) {
        $homepage = file_get_contents('http://ahadith.co.uk/'.$data);
//all the links data which was caught above using regex has been stored in $homepage
        $pattern_chapter= "/(?<=\<h2\>)(\s*.*\s*)(?=\<\/h2\>)/";
//Here I have fetched the chapters from the data stored in $homepage
        preg_match_all($pattern_chapter, $homepage, $matches_chapter, PREG_PATTERN_ORDER);
        foreach ($matches_chapter[1] as $chapters) {
        print_r($chapters);
        }
?>

现在我必须从存储在$homepage中的数据中获取分页链接。就像在这种情况下,分页有44页,我想获得所有44页的链接。这是匹配分页http:\/\/([a-zA-Z]+.[a-zA-Z]+.[a-zA-Z]+.[a-zA-Z]+.[a-zA-Z]+.[cid]+=[0-9]&[a-zA-Z]+=[0-9]&[a-zA-Z]+=[0-9]+)中的链接的正则表达式 我搜索过很多地方,但找不到任何相关内容。请任何人帮助我。

1 个答案:

答案 0 :(得分:0)

使用&#34; HtmlPageDom&#34;。它是第三方库,可以使用DOM轻松操作HTML文档。您可以从任何页面中提取任何类型的数据。

https://github.com/wasinger/htmlpagedom/blob/master/README.md