preg_replace - 这些单词中的哪一个导致了问题?

时间:2012-03-20 12:51:08

标签: php regex preg-replace

我试图做一个"找到并替换"我的网站上的词汇表术语。

这些术语取自我的数据库,并从一个简单的字符串数组中构建:

/* get the glossary terms */
$results = $wpdb->get_results( 'SELECT post_title AS list FROM wp_posts WHERE post_status="publish" AND post_type="glossary" AND post_parent>0' );

$glossary_terms = array();

foreach ( $results as $row ) {
    $term = preg_quote( str_replace( array("/", "'"), array("/", """), $row->list ) );
    $glossary_terms[] = $term;
}

$glossary_terms在以下函数中用作$glossary

$urls    = array();
$pattern = array();

// build a normalized lookup (case-insensitive, whitespace-agnostic)
foreach ($glossary as $term) {
    $term_norm = preg_replace('/\s+/', ' ', strtoupper(trim($term)));
    $pattern[] = preg_replace('/ /', '\\s+', preg_quote($term_norm));
    $initial = substr($term, 0, 1);
    $urls[$term_norm] = '/dev/glossary/' . $initial . '/' . rawurlencode($term);
    $rels[$term_norm] = '/dev/glossary/' . $initial . '/' . rawurlencode($term) . '?preview=true';
    $title[$term_norm] = $term;
}

$pattern  = '/\b(' . implode('|', $pattern) . ')\b/i';

现在,$pattern正在显示this list of words。这段摘录,包括我认为可能会给我带来问题的几句话,是:

  

红树\ S + TREE |地幔| MARACYN | MARACYN \ -2 |的大理石| MARGIN |边际| MARINE | MATROTROPHY |成熟|颌骨|上颌| MEANDER | MEDIAL | MEDIAN |黑色素|黑色素|膜| MENISCUS | MENTAL |心理\ S +触须|分节| MERISTICS |分节\ S + CHARACTER | MESETHMOID |内侧| MESO \ - | MESOCORACOID | META \ - |代谢|变态|甲硝唑| METHYLENE \ S + BLUE |微生物| MICROPREDATOR |珠孔|微卫星| MIGRATE |移民|毫升\ S + \(ML \)| MICRO \ S + CRAB |微升| MICROWORM | MILT |模仿|模拟| MIMIC |拟态| ML | MODAL |模式|软体动物|软体动物| MONO \ - |雌雄同株|一夫一妻制|一夫一妻制|单系|单特异性|单型|形态|形态测量|形态计量\ S + CHARACTER |形态计量|斑驳|嘴巴\ -BROODER |嘴巴\ S + ROT |

我遇到的问题是过滤器变得混乱,链接$content中的每个空格和单词。

我的问题是来自$pattern的哪些字词(根据pastebin / excerpt)导致此问题?我怀疑它与'中的BAUDELOT'S\s+LIGAMENT有关,但我不确定如何纠正这一点,因为preg_quote似乎无法逃脱撇号?


编辑这里有额外的代码,以尝试确定这是否是问题,而不是preg_replace

$text_nodes = $xpath->query('//text()[not(ancestor::a)]');

foreach($text_nodes as $original_node) {
    $text     = $original_node->nodeValue;
    $hitcount = preg_match_all($pattern, $text, $matches, PREG_OFFSET_CAPTURE);

    if ($hitcount == 0) continue;

    $offset   = 0;
    $parent   = $original_node->parentNode;
    $refnode  = $original_node->nextSibling;

    $parent->removeChild($original_node);

    foreach ($matches[0] as $i => $match) {
        $term_txt = $match[0];
        $term_pos = $match[1];
        $term_norm = preg_replace('/\s+/', ' ', strtoupper($term_txt));

        // insert any text before the term instance
        $prefix = substr($text, $offset, $term_pos - $offset);
        $parent->insertBefore($document->createTextNode($prefix), $refnode);

        // insert the actual term instance as a link
        $link = $document->createElement("a", $term_txt);
        $link->setAttribute("href", $urls[$term_norm]);
        $link->setAttribute("rel", $rels[$term_norm]);
        $link->setAttribute("class", "link_glossary");
        $parent->insertBefore($link, $refnode);

        $offset = $term_pos + strlen($term_txt);

        if ($i == $hitcount - 1) {  // last match, append remaining text
            $suffix = substr($text, $offset);
            $parent->insertBefore($document->createTextNode($suffix), $refnode);
        }
    }
}

提前致谢,

2 个答案:

答案 0 :(得分:1)

  

但是我不知道如何纠正这个问题,因为preg_quote似乎没有逃避撇号?

preg_quote不需要转义撇号,因为它们在正则表达式中并不特殊。

我不明白为什么这个正则表达式应该匹配每个空格和所有未列出的单词。

但是我看到的一个问题是,你用词边界\b包围了正则表达式的交替,这在单词注释以单词字符结尾的情况下会出现问题,例如“MACRO \ - |”或“MESO \ - | MESOCORACOID | META \ - |”。当然,如果在短划线之后直接有单词字符,它将匹配。 (我不知道你要匹配的文字。)

答案 1 :(得分:0)

$term_norm = preg_replace('/\s+/', ' ', strtoupper(trim($term)));

您需要preg_quote($term)那里^