如何用链接替换HTML文本中的术语表术语?

时间:2012-02-20 09:43:33

标签: php

我想运行str_replacepreg_replace,在我的$glossary_terms中查找某些字词(在$content中找到),并用链接替换它们(如{{ 1}})。

然而,<a href="/glossary/initial/term">term</a>是完整的HTML,我的链接/图片也受到影响,这不是我所追求的。

$content的一个例子是:

$content

我遇到了this link,但我不确定这种方法是否适用于嵌套HTML。

我是否可以<div id="attachment_542" class="wp-caption alignleft" style="width: 135px"><a href="http://www.seriouslyfish.com/dev/wp-content/uploads/2011/12/Amazonas-English-1.jpg"><img class="size-thumbnail wp-image-542" title="Amazonas English" src="http://www.seriouslyfish.com/dev/wp-content/uploads/2011/12/Amazonas-English-1-288x381.jpg" alt="Amazonas English" width="125" height="165" /></a><p class="wp-caption-text">Amazonas Magazine - now in English!</p></div> <p>Edited by Hans-Georg Evers, the magazine &#8216;Amazonas&#8217; has been widely-regarded as among the finest regular publications in the hobby since its launch in 2005, an impressive achievment considering it&#8217;s only been published in German to date. The long-awaited English version is just about to launch, and we think a subscription should be top of any serious fishkeeper&#8217;s Xmas list&#8230;</p> <p>The magazine is published in a bi-monthly basis and the English version launches with the January/February 2012 issue with distributors already organised in the United States, Canada, the United Kingdom, South Africa, Australia, and New Zealand. There are also mobile apps availablen which allow digital subscribers to read on portable devices.</p> <p>It&#8217;s fair to say that there currently exists no better publication for dedicated hobbyists with each issue featuring cutting-edge articles on fishes, invertebrates, aquatic plants, field trips to tropical destinations plus the latest in husbandry and breeding breakthroughs by expert aquarists, all accompanied by excellent photography throughout.</p> <p>U.S. residents can subscribe to the printed edition for just $29 USD per year, which also includes a free digital subscription, with the same offer available to Canadian readers for $41 USD or overseas subscribers for $49 USD. Please see the <a href="http://www.amazonasmagazine.com/">Amazonas website</a> for further information and a sample digital issue!</p> <p>Alternatively, subscribe directly to the print version <a href="https://www.amazonascustomerservice.com/subscribe/index2.php">here</a> or digital version <a href="https://www.amazonascustomerservice.com/subscribe/digital.php">here</a>. Just gonna add this to the end of the post so I can do some testing.</p> str_replace只在preg_replace个标签内容;排除任何嵌套的<p><a><img>代码?

提前致谢,

2 个答案:

答案 0 :(得分:1)

“书本解决方案”将是这样的:

<?php

$html = "<your HTML string>";
$glossary_terms = array('fishes', 'invertebrates', 'aquatic plants');

$dom = new DOMDocument;
$dom->loadHTML($html);

dom_link_glossary($dom, $glossary_terms);

echo $dom->saveHTML();

// wraps all occurrences of the glossary terms in links
function dom_link_glossary(&$document, &$glossary) {
  $xpath   = new DOMXPath($document);
  $urls    = array();
  $pattern = array();

  // build a normalized lookup (case-insensitive, whitespace-agnostic)
  foreach ($glossary as $term) {
    $term_norm = preg_replace('/\s+/', ' ', strtoupper(trim($term)));
    $pattern[] = preg_replace('/ /', '\\s+', preg_quote($term_norm));
    $urls[$term_norm] = '/glossary/initial/' . rawurlencode($term);
  }

  $pattern  = '/\b(' . implode('|', $pattern) . ')\b/i';
  $text_nodes = $xpath->query('//text()[not(ancestor::a)]');

  foreach($text_nodes as $original_node) {
    $text     = $original_node->nodeValue;
    $hitcount = preg_match_all($pattern, $text, $matches, PREG_OFFSET_CAPTURE);

    if ($hitcount == 0) continue;

    $offset   = 0;
    $parent   = $original_node->parentNode;
    $refnode  = $original_node->nextSibling;

    $parent->removeChild($original_node);

    foreach ($matches[0] as $i => $match) {
      $term_txt = $match[0];
      $term_pos = $match[1];
      $term_norm = preg_replace('/\s+/', ' ', strtoupper($term_txt));

      // insert any text before the term instance
      $prefix = substr($text, $offset, $term_pos - $offset);
      $parent->insertBefore($document->createTextNode($prefix), $refnode);

      // insert the actual term instance as a link
      $link = $document->createElement("a", $term_txt);
      $link->setAttribute("href", $urls[$term_norm]);
      $parent->insertBefore($link, $refnode);

      $offset = $term_pos + strlen($term_txt);

      if ($i == $hitcount - 1) {  // last match, append remaining text
        $suffix = substr($text, $offset);
        $parent->insertBefore($document->createTextNode($suffix), $refnode);
      }
    }
  }
}
?>

以下是dom_link_glossary()的工作原理:

  • 它标准化词汇表术语(修剪,大写,空格)并构建一个匹配所有术语的查找数组和正则表达式模式。
  • 它使用XPath查找尚未成为链接一部分的所有文本节点。返回文本节点而不管它们的嵌套深度(即我们不需要递归)。我使用\b来阻止部分匹配。
  • 对于包含术语的每个文本节点:
    • 删除原始文本节点($parent->removeChild()
    • 现在创建新节点并将其插入DOM:文本节点,用于术语表术语之前(或之后)的任何内容,元素节点(<a>)用于实际术语表术语。

解决方案保留原始案例和空白区域,因此

  • term将成为<a href="/glossary/initial/term">term</a>
  • Term将成为<a href="/glossary/initial/term">Term</a>
  • Foo Bar将成为<a href="/glossary/initial/foo%20bar">Foo Bar</a>。 HTML中的剩余空格或换行符不会破坏机制。

请注意,在纯文本节点值上使用正则表达式是完全正确的。在完整的HTML上使用正则表达式是不可行的。

我建议将术语表术语与数组中各自的URL配对,而不是计算函数中的URL。这样,您就可以将多个术语指向同一个网址。

答案 1 :(得分:0)

你可以试试这个:

$content = preg_replace('/(<p\sclass=\"wp\-caption\-text\">)[^<]+(<\/p>)/i', '', $content);