使用负面观察帮助PHP正则表达式

时间:2009-08-03 22:24:12

标签: php html xml regex negative-lookbehind

我正在尝试使用PHP preg_replace编写一个简单的函数来关闭丢失的HTML标记。

我认为这是相对简单的,但由于某种原因,它还没有。

我基本上要做的是在下一行中关闭一个缺少的标记:

<tr>
<th class="ProfileIndent0">
<p>Global pharmaceuticals</p>
<td>197.2</td>
<td>94</td>
</tr>

我一直采用的方法是使用负面看后面找到开头的td标签,这些标签之前没有打开和正确关闭的标签。

例如:

$text = preg_replace('!<th(\s\S*){0,1}?>(.*)((?<!<\/th>)[\s]*<td>)!U','<th$1>$2</th>',$text);

我写了正则表达式模式无数不同的方法无济于事。问题是我似乎无法仅仅匹配一个开放的td与缺少/ th在它之前 - 但它似乎匹配几个开放的td标签。

以下是完整的输入文字:

<CO_TEXT text_type_id="6">
        <TEXT_DATA><![CDATA[<table class="ProfileChart"> <tr> <th class="TableHead" colspan="21">2008 Sales</th> </tr>

<tr> <th class="ProfileIndent0"></th> <th class="ProfileHead">$ mil.</th> <th class="ProfileHead">% of total</th> </tr>

<tr> <th class="ProfileIndent0"> <p>Global pharmaceuticals</p> <td>197.2</td> <td>94</td> </tr>

<tr> <th class="ProfileIndent0">Impax pharmaceuticals</th> <td>12.9</td> <td>6</td> </tr>

<tr> <th class="ProfileTotal">Total</th> <td class="ProfileDataTotal">210.1</td> <td class="ProfileDataTotal">100</td> </tr> </table><h3>Selected Generic Products</h3><ul class="prodoplist"><li>Anagrelide hydrochloride (generic Agrylin, thrombocytosis)</li><li>Bupropion hydr ochloride (generic Wellbutrin SR, depression)</li><li>Colestipol hydrochloride (generic Colestid, high cholesterol)</li><li>Dantrolene sodium (generic  Dantrium, spasticity)</li><li>Metformin Hcl (generic Glucophage XR, diabetes)</li><li>Nadolol/Bendroflumethiazide (generic Corzide, hypertension)</li
><li>Oxybutynin chloride (generic Ditropan XL, urinary incontinence, with Teva)</li><li>Oxycodone hydrochloride (generic OxyContin controlled release,  pain)</li><li>Pilocarpine hydrochlorine (generic Salagen, dry mouth caused by radiation therapy)</li></ul>]]></TEXT_DATA> </CO_TEXT>

PHP中是否有一些我不知道的负面外观,或者我没有找到正确的匹配模式?

非常感谢任何帮助。

谢谢, 约翰

4 个答案:

答案 0 :(得分:3)

将我的评论写在你的问题上,我在想“肯定会有另一种解决方案,它不会涉及一些无法维持的正则表达式”......

也许我找到了办法;看看

第一本手册(引用):

  

与加载XML不同,HTML没有   要装得好。

第二本手册说:

  

从DOM创建HTML文档   表示。


尝试使用您提供的无效HTML字符串的示例提供了此示例:

$str = <<<STRING
<tr>
<th class="ProfileIndent0">
<p>Global pharmaceuticals</p>
<td>197.2</td>
<td>94</td>
</tr>
STRING;

$doc = new DOMDocument();
$doc->loadHTML($str);
echo $doc->saveHTML();

并且,当运行(从命令行,为了避免任何转移HTML以使其正确显示的麻烦),我得到:

$ php ./temp.php
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><tr>
<th class="ProfileIndent0">
<p>Global pharmaceuticals</p>
</th>
<td>197.2</td>
<td>94</td>
</tr></body></html>

其中,重新格式化,给出:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
    "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
    <body>
        <tr>
            <th class="ProfileIndent0">
                <p>Global pharmaceuticals</p>
            </th>
            <td>197.2</td>
            <td>94</td>
        </tr>
    </body>
</html>

还不完美,我承认(例如,它没有添加任何<table>标签),但至少,标签现在已关闭,应该...... < / p>

DOCTYPE<html>标记可能存在一些问题;您可能不需要这些......请查看手册页下的some comments:它们可能对您有帮助; - )



经过一番思考后编辑:

您的“完整”示例会产生一些警告;也许你可以在给loadHTML ...

喂食之前整理一下你的“HTML”
Warning: DOMDocument::loadHTML(): Tag co_text invalid in Entity, 
    line: 1 in /home/squale/developpement/tests/temp/temp.php on line 18
Warning: DOMDocument::loadHTML(): Tag text_data invalid in Entity, 
    line: 2 in /home/squale/developpement/tests/temp/temp.php on line 18
Warning: DOMDocument::loadHTML(): htmlParseStartTag: invalid element name in Entity, 
    line: 2 in /home/squale/developpement/tests/temp/temp.php on line 18
Warning: DOMDocument::loadHTML(): Unexpected end tag : table in Entity, 
    line: 10 in /home/squale/developpement/tests/temp/temp.php on line 18

更糟糕的是,您可以通过在调用函数之前和之后使用error_reporting函数或使用@ operator来掩盖这些错误。 我不会普遍推荐那些:但是在极端情况下应该使用那些 - 也许这个^^

然而,结果并不是坏事,实际上:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
    "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<body>
    <co_text text_type_id="6">
        <text_data>
            <tr>
                <th class="TableHead" colspan="21">2008 Sales</th> 
            </tr>
            <tr>
                <th class="ProfileIndent0"></th> 
                <th class="ProfileHead">$ mil.</th> 
                <th class="ProfileHead">% of total</th> 
            </tr>
            <tr>
                <th class="ProfileIndent0"> <p>Global pharmaceuticals</p> </th>
                <td>197.2</td> 
                <td>94</td> 
            </tr>
            <tr>
                <th class="ProfileIndent0">Impax pharmaceuticals</th> 
                <td>12.9</td> 
                <td>6</td> 
            </tr>
            <tr>
                <th class="ProfileTotal">Total</th> 
                <td class="ProfileDataTotal">210.1</td> 
                <td class="ProfileDataTotal">100</td> 
            </tr>
            <h3>Selected Generic Products</h3>
            <ul class="prodoplist">
                <li>Anagrelide hydrochloride (generic Agrylin, thrombocytosis)</li>
                <li>Bupropion hydr ochloride (generic Wellbutrin SR, depression)</li>
                <li>Colestipol hydrochloride (generic Colestid, high cholesterol)</li>
                <li>Dantrolene sodium (generic  Dantrium, spasticity)</li>
                <li>Metformin Hcl (generic Glucophage XR, diabetes)</li>
                <li>Nadolol/Bendroflumethiazide (generic Corzide, hypertension)</li>
                <li>Oxybutynin chloride (generic Ditropan XL, urinary incontinence, with Teva)</li>
                <li>Oxycodone hydrochloride (generic OxyContin controlled release,  pain)</li>
                <li>Pilocarpine hydrochlorine (generic Salagen, dry mouth caused by radiation therapy)</li>
            </ul>
        ]]&gt;
        </text_data>
    </co_text>
</body>
</html>


总而言之,正如其他人已经建议的那样,真正的HTML tidyier / purifier可能会有所帮助; - )

答案 1 :(得分:0)

  

问题在于,我似乎无法仅仅在一个开放的td上与之前缺少的</th>相匹配 - 但它似乎与几个开放的td标签相匹配。

听起来你想要'非贪婪'或'懒惰'匹配表达式。使用'*?''+?'代替'*''+',它会尽可能少地获取匹配字符,而不是尽可能多的字符。< / p>

答案 2 :(得分:0)

您也可以使用HTMLTidyHTML Purifier之类的内容自动修复HTML。

答案 3 :(得分:0)

这个正则表达式对我有用:

$text = preg_replace('@<th([^>]*)>(.*<\/td>)(<\/th>)?@','<th$1>$2</th>',$text);

请注意,它仅适用于单行行。我的意思是,它适用于:

<tr><th><td>some</td></tr>

但不适用于:

<tr><th>
<td>some</td>
</tr>

我真的不知道如何使用“s”修饰符。如果有人能解释我,我很感激。

以下是我的例子:

<?php
$html = '<CO_TEXT text_type_id="6">
        <TEXT_DATA><![CDATA[<table class="ProfileChart"> <tr> <th class="TableHead" colspan="21">2008 Sales</th> </tr>

<tr> <th class="ProfileIndent0"></th> <th class="ProfileHead">$ mil.</th> <th class="ProfileHead">% of total</th> </tr>

<tr> <th class="ProfileIndent0"> <p>Global pharmaceuticals</p> <td>197.2</td> <td>94</td> </tr>

<tr> <th class="ProfileIndent0">Impax pharmaceuticals</th> <td>12.9</td> <td>6</td> </tr>

<tr> <th class="ProfileTotal">Total</th> <td class="ProfileDataTotal">210.1</td> <td class="ProfileDataTotal">100</td> </tr> </table><h3>Selected Generic Products</h3><ul class="prodoplist"><li>Anagrelide hydrochloride (generic Agrylin, thrombocytosis)</li><li>Bupropion hydr ochloride (generic Wellbutrin SR, depression)</li><li>Colestipol hydrochloride (generic Colestid, high cholesterol)</li><li>Dantrolene sodium (generic  Dantrium, spasticity)</li><li>Metformin Hcl (generic Glucophage XR, diabetes)</li><li>Nadolol/Bendroflumethiazide (generic Corzide, hypertension)</li
><li>Oxybutynin chloride (generic Ditropan XL, urinary incontinence, with Teva)</li><li>Oxycodone hydrochloride (generic OxyContin controlled release,  pain)</li><li>Pilocarpine hydrochlorine (generic Salagen, dry mouth caused by radiation therapy)</li></ul>]]></TEXT_DATA> </CO_TEXT>';

$text = preg_replace('@<th([^>]*)>(.*<\/td>)(<\/th>)?@s','<th$1>$2</th>',$html);
echo $text;
?>

输出:

<CO_TEXT text_type_id="6">
        <TEXT_DATA><![CDATA[<table class="ProfileChart"> <tr> <th class="TableHead" colspan="21">2008 Sales</th> </tr>

<tr> <th class="ProfileIndent0"></th> <th class="ProfileHead">$ mil.</th> <th class="ProfileHead">% of total</th> </tr>

<tr> <th class="ProfileIndent0"> <p>Global pharmaceuticals</p> <td>197.2</td> <td>94</td> </tr>

<tr> <th class="ProfileIndent0">Impax pharmaceuticals</th> <td>12.9</td> <td>6</td> </tr>

<tr> <th class="ProfileTotal">Total</th> <td class="ProfileDataTotal">210.1</td> <td class="ProfileDataTotal">100</td></th> </tr> </table><h3>Selected Generic Products</h3><ul class="prodoplist"><li>Anagrelide hydrochloride (generic Agrylin, thrombocytosis)</li><li>Bupropion hydr ochloride (generic Wellbutrin SR, depression)</li><li>Colestipol hydrochloride (generic Colestid, high cholesterol)</li><li>Dantrolene sodium (generic  Dantrium, spasticity)</li><li>Metformin Hcl (generic Glucophage XR, diabetes)</li><li>Nadolol/Bendroflumethiazide (generic Corzide, hypertension)</li
><li>Oxybutynin chloride (generic Ditropan XL, urinary incontinence, with Teva)</li><li>Oxycodone hydrochloride (generic OxyContin controlled release,  pain)</li><li>Pilocarpine hydrochlorine (generic Salagen, dry mouth caused by radiation therapy)</li></ul>]]></TEXT_DATA> </CO_TEXT>