Question

在几个不同的伪装中，我问过这里的“过滤器”和WPSE。我现在采取不同的方法，我想使它坚实可靠。

我的情况：

当我在WordPress CMS中创建帖子时，我想运行一个搜索特定术语的过滤器，并用链接替换它们。
我有两个数组要搜索的字词：$glossary_terms和$species_terms。
$species_terms是一系列鱼类科学名称，例如Apistogramma panduro。
$glossary_terms是一系列养鱼词汇表术语，例如abdomen，caudal-fin和Gram's Method。

值得注意的是一些细微差别：

速度不是一个问题，因为我将在后台运行此过滤器，而不是当用户访问该页面或者作者提交/编辑物种档案或邮寄。
正在过滤的部分内容可能包含带有这些字词的HTML，例如<img src="image.jpg" title="Apistogramma panduro male" />。显然这些不应该被替换。
物种通常被称为缩写的Genus，因此您经常会看到Apistogramma panduro而不是A. panduro。这意味着我需要搜索＆amp;将所有物种术语替换为缩写 - Apistogramma panduro，A. panduro，Satanoperca daemon，S. daemon等。
如果词汇表中存在caudal-fin和caudal，则应先替换caudal-fin。

我正在考虑简单地添加搜索条件的preg_replace，但只留下左边的空格（即( )term）和空格，逗号，感叹号，句号或连字符在右边（即term(, . ! - )），但这不会帮助我不破坏图像HTML。

示例内容

<br />
It looks very similar to fishes of the <i><a href="species/betta-foerschi" rel="species/betta-foerschi/?hover=true" class="link_species">B. foerschi</a></i> group/complex but its breeding strategy, adult size and observed behaviour preclude its inclusion in that <a href="glossary/a/assemblage" rel="glossary/a/assemblage?hover=true" class="link_glossary">assemblage</a>.

Instead it appears to be a member of the <i><a href="species/betta-coccina" rel="species/betta-coccina/?hover=true" class="link_species">B. coccina</a></i> group which currently includes <i><a href="species/betta-brownorum" rel="species/betta-brownorum/?hover=true" class="link_species">B. brownorum</a></i>, <i><a href="species/betta-burdigala" rel="species/betta-burdigala/?hover=true" class="link_species">B. burdigala</a></i>, <i><a href="species/betta-coccina" rel="species/betta-coccina/?hover=true" class="link_species">B. coccina</a></i>, <i><a href="species/betta-livida" rel="species/betta-livida/?hover=true" class="link_species">B. livida</a></i>, <i>B. miniopinna</i>, <i><a href="species/betta-persephone" rel="species/betta-persephone/?hover=true" class="link_species">B. persephone</a></i>, <i>B. tussyae</i>, <i><a href="species/betta-rutilans" rel="species/betta-rutilans/?hover=true" class="link_species">B. rutilans</a></i> and <i><a href="species/betta-uberis" rel="species/betta-uberis/?hover=true" class="link_species">B. uberis</a></i>.

Of these it's most similar in appearance to <i><a href="species/betta-uberis" rel="species/betta-uberis/?hover=true" class="link_species">B. uberis</a></i> but can be distinguished by its noticeably shorter <a href="glossary/d/dorsal" rel="glossary/d/dorsal?hover=true" class="link_glossary">dorsal</a>-<a href="glossary/f/fin" rel="glossary/f/fin?hover=true" class="link_glossary">fin</a> <a href="glossary/b/base" rel="glossary/b/base?hover=true" class="link_glossary">base</a> and overall blue-greenish (vs. green/reddish) colouration.

Members of this group are characterised by their small adult size (&lt; 40 mm SL), a uniform red or black <a href="glossary/b/base" rel="glossary/b/base?hover=true" class="link_glossary">base</a> body colour, the presence of a <a href="glossary/m/midlateral" rel="glossary/m/midlateral?hover=true" class="link_glossary">midlateral</a> body blotch in some <a href="glossary/s/species" rel="glossary/s/species?hover=true" class="link_glossary">species</a> and the fact they have 9 abdominal <a href="glossary/v/vertebrae" rel="glossary/v/vertebrae?hover=true" class="link_glossary">vertebrae</a> compared with 10-12 in the other <a href="glossary/s/species" rel="glossary/s/species?hover=true" class="link_glossary">species</a> groups. In addition all are <a href="glossary/o/obligate" rel="glossary/o/obligate?hover=true" class="link_glossary">obligate</a> <a href="glossary/p/peat" rel="glossary/p/peat?hover=true" class="link_glossary">peat</a> <a href="glossary/s/swamp" rel="glossary/s/swamp?hover=true" class="link_glossary">swamp</a> dwellers (Tan and Ng, 2005).<br />

^^^此示例此处已手动插入正确的链接。过滤器不应该破坏这些链接！

It looks very similar to fishes of the B. foerschi group/complex but its breeding strategy, adult size and observed behaviour preclude its inclusion in that assemblage.

Instead it appears to be a member of the B. coccina group which currently includes B. brownorum, B. burdigala, B. coccina, B. livida, B. miniopinna, B. persephone, B. tussyae, B. rutilans and B. uberis.

Of these it's most similar in appearance to B. uberis but can be distinguished by its noticeably shorter dorsal-fin base and overall blue-greenish (vs. green/reddish) colouration.

Members of this group are characterised by their small adult size (< 40 mm SL), a uniform red or black base body colour, the presence of a midlateral body blotch in some species and the fact they have 9 abdominal vertebrae compared with 10-12 in the other species groups. In addition all are obligate peat swamp dwellers (Tan and Ng, 2005).

^^^预格式化的相同示例。

[caption id="attachment_542" align="alignleft" width="125" caption="Amazonas Magazine - now in English!"]<a href="http://www.seriouslyfish.comwp-content/uploads/2011/12/Amazonas-English-1.jpg"><img class="size-thumbnail wp-image-542" title="Amazonas English" src="/wp-content/uploads/2011/12/Amazonas-English-1-288x381.jpg" alt="Amazonas English" width="125" height="165" /></a>[/caption]

Edited by Hans-Georg Evers, the magazine 'Amazonas' has been widely-regarded as among the finest regular publications in the hobby since its launch in 2005, an impressive achievment considering it's only been published in German to date. The long-awaited English version is just about to launch, and we think a subscription should be top of any serious fishkeeper's Xmas list...

The magazine is published in a bi-monthly basis and the English version launches with the January/February 2012 issue with distributors already organised in the United States, Canada, the United Kingdom, South Africa, Australia, and New Zealand. There are also mobile apps availablen which allow digital subscribers to read on portable devices.

It's fair to say that there currently exists no better publication for dedicated hobbyists with each issue featuring cutting-edge articles on fishes, invertebrates, aquatic plants, field trips to tropical destinations plus the latest in husbandry and breeding breakthroughs by expert aquarists, all accompanied by excellent photography throughout.

U.S. residents can subscribe to the printed edition for just $29 USD per year, which also includes a free digital subscription, with the same offer available to Canadian readers for $41 USD or overseas subscribers for $49 USD. Please see the <a href="http://www.amazonasmagazine.com/">Amazonas website</a> for further information and a sample digital issue!

Alternatively, subscribe directly to the print version <a href="https://www.amazonascustomerservice.com/subscribe/index2.php">here</a> or digital version <a href="https://www.amazonascustomerservice.com/subscribe/digital.php">here</a>.

^^^这可能只有一些词汇表术语，而不是任何物种链接。

示例条款

$species_terms

339 => 'Aulonocara maylandi maylandi',
340 => 'Aulonocara maylandi kandeensis',
341 => 'Aulonocara sp. "walteri"',
342 => 'Aulonocara sp. "stuartgranti maleri"',
343 => 'Aulonocara stuartgranti',
344 => 'Benthochromis tricoti',
345 => 'Boulengerochromis microlepis',
346 => 'Buccochromis lepturus',
347 => 'Buccochromis nototaenia',
348 => 'Betta brownorum',
349 => 'Betta foerschi',
350 => 'Betta coccina',
351 => 'Betta uberis'

正如您在上面所看到的，这些科学名称的一般格式是“属种”，但通常包括“sp。”。或“aff。” （对于未正式描述的物种）和“属种亚种”形式。

$glossary_terms

1 => 'abdomen',
2 => 'caudal',
3 => 'caudal-fin',
4 => 'caudal-fin peduncle',
5 => 'Gram\'s Method'

如果有人能够提出满足所有这些条件和要求的过滤器，我想提供赏金。

提前致谢，

Answer 1

我认为使用DOMDocument功能比使用regexp更好。这是一个工作原型：

// Each dynamically constructed regexp will contain at most 70 subpatterns
define('GROUPS_PER_REGEXPS', 70);

$speciesTerms = array(
  339 => '(?:Aulonocara|A\.) maylandi maylandi',
  340 => '(?:Aulonocara|A\.) maylandi kandeensis',
  344 => '(?:Benthochromis|B\.) tricoti',
  345 => '(?:Boulengerochromis|B\.) microlepis',
);

function matchTerms($text) {
  // Globals are not good. I left it for the simplicity
  global $speciesTerms;

  $result = array();
  $t = 0;
  $speciesCount = count($speciesTerms);
  reset($speciesTerms);
  while ($t < $speciesCount) {
    // Maps capturing group identifiers to term ids
    $termMapping = array();

    // Dynamically construct regexp
    $groups = '';
    $c = 1;
    while (list($termId, $termPattern) = each($speciesTerms)) {
      if (!empty($groups)) {
        $groups .= '|';
      }
      // Match word boundaries, so we don't capture "B. tricotisomeramblingstring"
      $groups .= '(\b' . $termPattern . '\b)';
      $termMapping[$c++] = $termId;
      if (++$t % GROUPS_PER_REGEXPS == 0) {
        break;
      }
    }
    $regexp = "/$groups/m";
    preg_match_all($regexp, $text, $matches, PREG_OFFSET_CAPTURE);
    for ($i = 1; $i < $c; $i++) {
      foreach ($matches[$i] as $matchData) {
        // matchData[0] holds matched string, e.g. Benthochromis tricoti
        // matchData[1] holds offset, e.g. 15
        if (isset($matchData[0]) && !empty($matchData[0])) {
          $result[] = array(
            'text' => $matchData[0],
            'offset' => $matchData[1],
            'id' => $termMapping[$i],
          );
        }
      }
    }
  }
  // Sort by offset in descending order
  usort($result, function($a, $b) {
    return $a['offset'] > $b['offset'] ? -1 : 1;
  });
  return $result;
}

$doc = DOMDocument::loadHTML($html);

// Stack will be used to avoid recursive functions
$stack = new SplStack;
$stack->push($doc);
while (!$stack->isEmpty()) {
  $node = $stack->pop();
  if ($node->nodeType == XML_TEXT_NODE && $node->parentNode instanceof DOMElement) {
    // $node represents text node
    //  and it's inside a tag (second condition in the statement above)

    // Check that this text is not wrapped in <a> tag
    //  as we don't want to wrap it twice
    if ($node->parentNode->tagName != 'a') {
      $matches = matchTerms($node->wholeText);
      foreach ($matches as $match) {
        // Create new link element in the DOM
        $link = $doc->createElement('a', $match['text']);
        $link->setAttribute('href', 'species/' . $match['id']);
        $link->setAttribute('class', 'link_species');

        // Save the text after the link
        $remainingText = $node->splitText($match['offset'] + strlen($match['text']));
        // Save the text before the link
        $linkText = $node->splitText($match['offset']);

        // Replace $linkText with $link node
        //  i.e. 'something' becomes '<a href="..">something</a>'
        $node->parentNode->replaceChild($link, $linkText);
      }
    }
  }
  if ($node->hasChildNodes()) {
    foreach ($node->childNodes as $childNode) {
      $stack->push($childNode);
    }
  }
}

$body = $doc->getElementsByTagName('body');
echo $doc->saveHTML($body->item(0));

实施细节

我只展示了如何替换物种术语，词汇表术语将是相同的。链接以“species / $ id”的形式形成。缩写正确处理。 DOMDocument是一个非常可靠的解析器，它可以处理损坏的标记并且速度很快。

regexp中的

?:不允许将此子模式计为捕获组（documentation on subpatterns）。如果没有正确计算子模式，我们就无法检索termId。我们的想法是通过连接$speciesTerms数组中指定的所有正则表达式并使用管道|分隔它们来构建一个大的正则表达式模式。前两个物种的最终正则表达式（为了清晰起见）：

       First capturing group             Alternation       Second capturing group
( (?:Aulonocara|A\.) maylandi maylandi )      |       ( (?:Aulonocara|A\.) maylandi kandeensis )

因此，文本“示例：Aulonocara maylandi maylandi，A。maylandi kandeensis”将给出以下匹配：

$matches[1] = array('Aulonocara maylandi maylandi') // Captured by the first group
$matches[2] = array('A. maylandi kandeensis') // Captured by the second group

我们可以清楚地说matches[1]中的所有元素都指的是id = 339的种类Aulonocara maylandi maylandi或A. maylandi maylandi。

简而言之：如果您在(?:)中使用子模式，请使用$speciesTerms。

<强>更新每个动态创建的regexp都对子模式的最大数量有限制，它被定义为顶部的const。这允许在regexp中避免PCRE对子模式数量的限制。

重要说明：

如果你有很多术语，你应该重写matchTerms，因为regexp对一些子模式有限制。在这种情况下，最好从每N个术语中预建一组正则表达式。
matchTerms会在每次通话时生成正则表达式，显然只能执行一次
可以在speciesTerms
strlen =＆gt;如果您使用多字节编码，则mb_strlen
提供的$html将包含在<body>代码中（除非已经包装）

Answer 2

解析HTML而不是尝试使用正则表达式会更好。当你想要匹配某些特定的东西时，正则表达式很好，但是当你试图不匹配某些东西时会变得古怪。

使用http://simplehtmldom.sourceforge.net/：

function addLinks(&$p, $species, $terms) {

  // much easier to say "not in an anchor tag" with parsed content than with regex
  if ($p->tag != 'a') {

    // pull out existing elements so they aren't replaced
    $children = array();
    $x = 0;

    foreach ($p->children as &$e) {
      $children[] = $e->outertext;
      $e->outertext = '---child-'.$x.'---';
      $x++;
    }

    foreach($species as $s) {
      $p->innertext = str_replace(
          $s,
          '<a href="species/'.strtolower(str_replace(' ','-',$s)).'">'.$s.'</a>',
          $p->innertext);
    }

    foreach($term as $t) {
      $p->innertext = str_replace(
          $t,
          '<a href="glossary/'.
              strtolower($t[0]).'/'.
              strtolower(str_replace(' ','-',$t)).'">'.$t.'</a>',
          $p->innertext);
    }

    // restore previous child elements
    foreach($children as $x => $e) {
      $p->innertext = str_replace('---child-'.$x.'---', $e, $p->innertext);
    }

    foreach ($p->children() as &$e) {
      addLinks($e, $species, $terms);
    }
  }
}


$html = new simple_html_dom();

// you may have to wrap $content in a div. not exactly sure how partial content is handled
$html->load($content);

addLinks($html, $species_terms, $glossary_terms);
$content = $html->save();

我没有使用过simple_html_dom，但这应该让你指向正确的方向。

可靠有效的自定义搜索＆amp;替换功能 - preg或str替换

2 个答案: