PHP正则表达式 - 空匹配

时间:2014-06-09 20:50:43

标签: php regex preg-match-all

我正在尝试从字符串中提取(由CURL提取的整个网站源代码)

<tr>
    <td><a href="http://www.gpw.pl/karta_spolki/LT0000128555/">AAL</a></td>
<td><a href="http://www.gpw.pl/karta_spolki/LT0000128555/">AVIAAM LEASING AB</a></td>
</tr>
<tr class="even">
    <td><a href="http://www.gpw.pl/karta_spolki/PLTRNSU00013/">AAT</a></td>
    <td><a href="http://www.gpw.pl/karta_spolki/PLTRNSU00013/">ALTA SPÓŁKA AKCYJNA</a></td>

我希望在数组中匹配所有3个字符的锚点,例如AALAAT(还有更多)

我拥有的是:

$subject = curl_exec($ch);        
$pattern = '`<td><a href="http://www\.gpw\.pl/karta_spolki/[0-9A-Za-z ]+/">[0-9A-Z]{3}</a></td>`';
preg_match_all($pattern, $subject, $matches, PREG_PATTERN_ORDER);
print_r($matches);

结果我得到了

Array ( [0] => Array ( ) ) 

你能给我任何建议如何解决吗?

2 个答案:

答案 0 :(得分:1)

你可以使用DOMDocument对象来构建你的数组:

$doc = new DOMDocument();
$doc->LoadHTML($str);

$matches = array();
foreach($doc->getElementsByTagName('a') as $a) {
    $text = $a->nodeValue;
    if(strlen($text) === 3) $matches[] = $text;
}

这将迭代HTML字符串中的所有锚元素并构建此数组:

Array
(
    [0] => AAL
    [1] => AAT
)

答案 1 :(得分:1)

我刚试过你的例子&amp;你的正则表达式按预期工作,提供了小样本:

$subject = <<<EOT
<tr>
    <td><a href="http://www.gpw.pl/karta_spolki/LT0000128555/">AAL</a></td>
<td><a href="http://www.gpw.pl/karta_spolki/LT0000128555/">AVIAAM LEASING AB</a></td>
</tr>
<tr class="even">
    <td><a href="http://www.gpw.pl/karta_spolki/PLTRNSU00013/">AAT</a></td>
    <td><a href="http://www.gpw.pl/karta_spolki/PLTRNSU00013/">ALTA SPÓŁKA AKCYJNA</a></td>
EOT;

$pattern = '`<td><a href="http://www\.gpw\.pl/karta_spolki/[0-9A-Za-z ]+/">[0-9A-Z]{3}</a></td>`';
preg_match_all($pattern, $subject, $matches, PREG_PATTERN_ORDER);

echo '<pre>';
print_r($matches);
echo '</pre>';

结果:

Array
(
    [0] => Array
        (
            [0] => AAL
            [1] => AAT
        )

)

但是那说,我实际上为curl请求挖出了我认为your source URL的内容,当我测试它时它失败了。所以我把正则表达式调整为:

/(?<=>)[0-9A-Z]{3}(?=<\/a><\/td>)/is

现在事情似乎与我的代码很好地协同工作,试图重新创建您正在进行的curl请求。

// Set the URL.
$url="http://www.gpw.pl/lista_spolek_en";

// The actual curl request.
$curl_timeout = 5;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $curl_timeout);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$subject = curl_exec($ch);
curl_close($ch);

// Set the regex pattern.
$pattern = '/(?<=>)[0-9A-Z]{3}(?=<\/a><\/td>)/is';

// Run the preg match all command with the regex pattern.
preg_match_all($pattern, $subject, $matches, PREG_PATTERN_ORDER);

// Return the results.
echo '<pre>';
print_r($matches);
echo '</pre>';

从我的角度来看,输出似乎很有效:

Array
(
    [0] => Array
        (
            [0] => AAL
            [1] => AAT
            [2] => ABC
            [3] => ABE
            [4] => ABM
            [5] => ABS
            [6] => ACE
            [7] => ACG
            [8] => ACP
            [9] => ACS
            [10] => ACT
            [11] => ADS
            [12] => AGO
            [13] => AGT
            [14] => ALC
            [15] => ALM
            [16] => ALR
            [17] => ALT
            [18] => AMB
            [19] => AMC
            [20] => APL
            [21] => APN
            [22] => APT
            [23] => ARC
            [24] => ARR
            [25] => ASB
            [26] => ASE
            [27] => ASG
            [28] => AST
            [29] => ATC
            [30] => ATD
            [31] => ATG
            [32] => ATL
            [33] => ATM
            [34] => ATP
            [35] => ATR
            [36] => ATS
            [37] => AWB
            [38] => AWG
            [39] => EAT
            [40] => ACP
            [41] => ALR
            [42] => BZW
            [43] => EUR
            [44] => JSW
            [45] => KER
            [46] => KGH
            [47] => LPP
            [48] => LTS
            [49] => LWB
            [50] => MBK
            [51] => OPL
            [52] => PEO
            [53] => PGE
            [54] => PGN
            [55] => PKN
            [56] => PKO
            [57] => PZU
            [58] => SNS
            [59] => TPE
        )

)