Question

我正在使用file_get_contents()获取HTML网页，我得到如下表格，有超过150行：

<tr class="tabrow ">
    <td class="tabcol  tdmin_2l">FIRST+DATA</td>
    <td class="tabcol">
        <a class="modal-button" title="SECOND+DATA"  href="THIRD+DATA" rel="{handler: 'iframe', size: {x: 800, y: 640}, overlayOpacity: 0.9, classWindow: 'phocamaps-plugin-window', classOverlay: 'phocamaps-plugin-overlay'}">
            asdxxx
        </a>
    </td>
    <td class="tabcol"></td>
    <td class="tabcol">FOURTH+DATA</td>
</tr>

我想通过FIRST DATA电话获取SECOND DATA，THIRD DATA，FOURTH DATA和preg_match_all()。我试着写多种模式，但我不能成功。这是我试过的：

preg_match_all('/(<td class="tabcol  tdmin_2l">|title=")(.*?)(<\/td>|")/s', $raw, $matches, PREG_SET_ORDER);

真正的模式是什么？

Answer 1

它没有直接回答你的问题，但这是正确的方法。

您应该避免使用正则表达式解析HTML / XML内容。想知道为什么？

使用正则表达式无法进行整个HTML解析，因为它依赖于匹配开头和结束标记，这是正则表达式无法实现的。

正则表达式只能匹配常规语言，但HTML是无上下文的语言。你可以用HTML上的regexp做的唯一的事情就是启发式，但这并不适用于所有条件。应该可以呈现一个HTML文件，该文件将被任何正则表达式错误地匹配。

- https://stackoverflow.com/a/590789/65732

请改用DOM parser。以下是对它的看法：

composer require symfony/dom-crawler symfony/css-selector

<?php

require 'vendor/autoload.php';

use Symfony\Component\DomCrawler\Crawler;

$html = <<<HTML
<tr class="tabrow ">
<td class="tabcol  tdmin_2l">FIRST+DATA</td>
<td class="tabcol"><a class="modal-button" title="SECOND+DATA"  href="THIRD+DATA" rel="{handler: 'iframe', size: {x: 800, y: 640}, overlayOpacity: 0.9, classWindow: 'phocamaps-plugin-window', classOverlay: 'phocamaps-plugin-overlay'}">asdxxx</a></td>
<td class="tabcol"></td>
<td class="tabcol">FOURTH+DATA</td>
</tr>
HTML;

$crawler = new Crawler($html);

$first  = $crawler->filter('.tabcol.tdmin_2l')->text();
$second = $crawler->filter('.tabcol:nth-child(2) a')->attr('title');
$third  = $crawler->filter('.tabcol:nth-child(2) a')->attr('href');
$fourth = $crawler->filter('.tabcol:nth-child(4)')->text();

var_dump($first, $second, $third, $fourth);
// Outputs:
// string(10) "FIRST+DATA"
// string(11) "SECOND+DATA"
// string(10) "THIRD+DATA"
// string(11) "FOURTH+DATA"

更简单，更干净，对吧？

使用这样的解析器，您也可以使用XPath提取元素。

Answer 2

试试这个：

$str = <<<HTML
<tr class="tabrow ">
<td class="tabcol  tdmin_2l">FIRST+DATA</td>
<td class="tabcol"><a class="modal-button" title="SECOND+DATA"  href="THIRD+DATA" rel="{handler: 'iframe', size: {x: 800, y: 640}, overlayOpacity: 0.9, classWindow: 'phocamaps-plugin-window', classOverlay: 'phocamaps-plugin-overlay'}">asdxxx</a></td>
<td class="tabcol"></td>
<td class="tabcol">FOURTH+DATA</td>
</tr>
HTML;

preg_match_all('/<td[^>]*>(.*?)<\/td>/im', $str, $td_matches);
preg_match('/ title="([^"]*)"/i', $td_matches[1][1], $title);
preg_match('/ href="([^"]*)"/i', $td_matches[1][1], $href);

echo $td_matches[1][0] . "\n";
echo $title[1] . "\n";
echo $href[1] . "\n";
echo $td_matches[1][3];

编写多个正则表达式模式来解析HTML

2 个答案: