我正在尝试从字符串中提取(由CURL提取的整个网站源代码)
<tr>
<td><a href="http://www.gpw.pl/karta_spolki/LT0000128555/">AAL</a></td>
<td><a href="http://www.gpw.pl/karta_spolki/LT0000128555/">AVIAAM LEASING AB</a></td>
</tr>
<tr class="even">
<td><a href="http://www.gpw.pl/karta_spolki/PLTRNSU00013/">AAT</a></td>
<td><a href="http://www.gpw.pl/karta_spolki/PLTRNSU00013/">ALTA SPÓŁKA AKCYJNA</a></td>
我希望在数组中匹配所有3个字符的锚点,例如AAL
和AAT
(还有更多)
我拥有的是:
$subject = curl_exec($ch);
$pattern = '`<td><a href="http://www\.gpw\.pl/karta_spolki/[0-9A-Za-z ]+/">[0-9A-Z]{3}</a></td>`';
preg_match_all($pattern, $subject, $matches, PREG_PATTERN_ORDER);
print_r($matches);
结果我得到了
Array ( [0] => Array ( ) )
你能给我任何建议如何解决吗?
答案 0 :(得分:1)
你可以使用DOMDocument
对象来构建你的数组:
$doc = new DOMDocument();
$doc->LoadHTML($str);
$matches = array();
foreach($doc->getElementsByTagName('a') as $a) {
$text = $a->nodeValue;
if(strlen($text) === 3) $matches[] = $text;
}
这将迭代HTML字符串中的所有锚元素并构建此数组:
Array
(
[0] => AAL
[1] => AAT
)
答案 1 :(得分:1)
我刚试过你的例子&amp;你的正则表达式按预期工作,提供了小样本:
$subject = <<<EOT
<tr>
<td><a href="http://www.gpw.pl/karta_spolki/LT0000128555/">AAL</a></td>
<td><a href="http://www.gpw.pl/karta_spolki/LT0000128555/">AVIAAM LEASING AB</a></td>
</tr>
<tr class="even">
<td><a href="http://www.gpw.pl/karta_spolki/PLTRNSU00013/">AAT</a></td>
<td><a href="http://www.gpw.pl/karta_spolki/PLTRNSU00013/">ALTA SPÓŁKA AKCYJNA</a></td>
EOT;
$pattern = '`<td><a href="http://www\.gpw\.pl/karta_spolki/[0-9A-Za-z ]+/">[0-9A-Z]{3}</a></td>`';
preg_match_all($pattern, $subject, $matches, PREG_PATTERN_ORDER);
echo '<pre>';
print_r($matches);
echo '</pre>';
结果:
Array
(
[0] => Array
(
[0] => AAL
[1] => AAT
)
)
但是那说,我实际上为curl
请求挖出了我认为your source URL的内容,当我测试它时它失败了。所以我把正则表达式调整为:
/(?<=>)[0-9A-Z]{3}(?=<\/a><\/td>)/is
现在事情似乎与我的代码很好地协同工作,试图重新创建您正在进行的curl
请求。
// Set the URL.
$url="http://www.gpw.pl/lista_spolek_en";
// The actual curl request.
$curl_timeout = 5;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $curl_timeout);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$subject = curl_exec($ch);
curl_close($ch);
// Set the regex pattern.
$pattern = '/(?<=>)[0-9A-Z]{3}(?=<\/a><\/td>)/is';
// Run the preg match all command with the regex pattern.
preg_match_all($pattern, $subject, $matches, PREG_PATTERN_ORDER);
// Return the results.
echo '<pre>';
print_r($matches);
echo '</pre>';
从我的角度来看,输出似乎很有效:
Array
(
[0] => Array
(
[0] => AAL
[1] => AAT
[2] => ABC
[3] => ABE
[4] => ABM
[5] => ABS
[6] => ACE
[7] => ACG
[8] => ACP
[9] => ACS
[10] => ACT
[11] => ADS
[12] => AGO
[13] => AGT
[14] => ALC
[15] => ALM
[16] => ALR
[17] => ALT
[18] => AMB
[19] => AMC
[20] => APL
[21] => APN
[22] => APT
[23] => ARC
[24] => ARR
[25] => ASB
[26] => ASE
[27] => ASG
[28] => AST
[29] => ATC
[30] => ATD
[31] => ATG
[32] => ATL
[33] => ATM
[34] => ATP
[35] => ATR
[36] => ATS
[37] => AWB
[38] => AWG
[39] => EAT
[40] => ACP
[41] => ALR
[42] => BZW
[43] => EUR
[44] => JSW
[45] => KER
[46] => KGH
[47] => LPP
[48] => LTS
[49] => LWB
[50] => MBK
[51] => OPL
[52] => PEO
[53] => PGE
[54] => PGN
[55] => PKN
[56] => PKO
[57] => PZU
[58] => SNS
[59] => TPE
)
)