Question

我在CMS数据库中存储了大量部分HTML。

我正在寻找一种方法来浏览HTML并查找没有标题的任何<a></a>标记，并根据标记的内容为它们添加标题。

所以，如果我有<a href="somepage">some text</a>，我想修改标签：

<a title="some text" href="somepage"></a>

有些标签已经有了标题，有些锚标签之间没有任何内容。

到目前为止，我已经设法在php和regex上取得了一些进展。

但我似乎无法获得锚点的内容，它只显示1或0。

<?php
$file = "test.txt";
$handle = fopen("$file", "r");
$theData = fread($handle, filesize($file));
$line = explode("\r\n", $theData);

$regex = '/^.*<a ((?!title).)*$/'; //finds all lines that don't contain an anchor with a title
$regex2 = '/<a .*><\/a>/'; //finds all lines that have nothing between the anchors
$regex3 = '/<a.*?>(.+?)<\/a>/'; //finds the contents of the anchors

foreach ($line as $lines)
{
  if (!preg_match($regex2, $lines) && preg_match($regex, $lines)){
    $tags = $lines;
    $contents = preg_match($regex3, $tags);
    $replaced = str_replace("<a ", "<a title=\"$contents\" ", $lines);
    echo $replaced ."\r\n";
  }
  else {
  echo $lines. "\r\n";
  }
}
?>

我理解正则表达式可能不是解析HTML的最佳方式，因此非常感谢任何帮助或替代建议。

Answer 1

使用PHP的内置DOM解析。比正则表达式更可靠。请注意，将HTML加载到PHP DOM中会使其正常化。

$doc = new DOMDocument();
@$doc->loadHTML($html); //supress parsing errors with @

$links = $doc->getElementsByTagName('a');
foreach ($links as $link) {
if ($link->getAttribute('title') == '') {
        $link->setAttribute('title', $link->nodeValue);
    }
}
$html = $doc->saveHTML();

Answer 2

永远不要在解析HTML时使用regex。在php中，使用DOM。

这是一个更简单的问题：http://simplehtmldom.sourceforge.net/

Answer 3

如果它是连贯的，你可以使用简单的正则表达式。但如果您的主播有类或任何东西，它将会失败。它也没有正确编码title = attribute：

preg_replace('#<(a\s+href="[^"]+")>([^<>]+)</a>#ims', '<$1 title="$2">$2</a>',);

因此，phpQuery / querypath可能是robuster方法：

$html = phpQuery::newDocument($html);
foreach ($html->find("a") as $a) {
    if (empty($a->attr("title")) {
         $a->attr("title", $a->text());
    }
}
print $html->getDocument();

解析HTML并替换字符串

3 个答案: