Question

我必须为我的项目使用抓取工具。

我使用简单的dom类来获取页面中的所有链接。

现在我想只过滤那些"/questions/3904482/<title of the question"。

形式的链接

这是我的尝试：

include_once('simple_html_dom.php');
$html = new simple_html_dom();
$html->load_file('http://stackoverflow.com/questions?sort=newest');
$pat='#^/question/([0-9]+)/#';
foreach($html->find('a') as $link)
{
    echo preg_match($pat, $link->href);
    {
        echo $link->href."<br>";
    }
}

所有链接都被过滤掉了。

Answer 1

你说网址是问题* s *但你的模式显示没有s

此外，您似乎应该使用if而不是echo

include_once('simple_html_dom.php');
$html = new simple_html_dom();
$html->load_file('http://stackoverflow.com/questions?sort=newest');
$pat='#^/questions/([0-9]+)/#';
foreach($html->find('a') as $link)
{

    if ( preg_match($pat, $link->href) )
    {
        echo $link->href."<br>";
    }
}

Answer 2

您可以利用DOM和XPath：

<?php

$dom = new DOMDocument;
@$dom->loadHTMLFile('http://stackoverflow.com/questions?sort=newest');
$xpath = new DOMXPath($dom);
$questions = $xpath->query("//a[contains(@href, '/questions/') and not(contains(@href, '/tagged/')) and not(contains(@href, '/ask'))]");

foreach ($questions as $question) {
    print "{$question->getAttribute('href')} => {$question->nodeValue}";
}

使用php将URL匹配到模式

2 个答案: