Question

在我的网站上，我有天主教百科全书。它有超过11,000篇文章。

我有兴趣在我的网站上的文章中替换单词和短语，并链接到天主教百科全书中的相关条目。所以，如果有人说：

ST。彼得是第一位教皇。

它应该取代圣彼得与圣彼得文章的链接，以及教皇与教皇文章的链接。

我有它工作，但它很慢。有超过30,000种可能的替代品，因此优化非常重要。我只是不确定从哪里开始。

这是我现有的代码。请注意，它使用的是Drupal。此外，它用[cathenlink]标签替换了单词，并且该标签在代码中稍后被实际HTML链接替换。

function ce_execute_filter($text)
{

    // If text is empty, return as-is
    if (!$text) {
        return $text;
    }

    // Split by paragraph
    $lines = preg_split('/\n+/', $text, -1, PREG_SPLIT_DELIM_CAPTURE);

    // Contains the parsed and linked text
    $linked_text = '';

    foreach ($lines as $line)
    {

        // If this fragment is only one or more newline characters,
        // Add it to $linked_text and continue without parsing
        if (preg_match('/^\n+$/', $line)) {
            $linked_text .= $line;
            continue;
        }

        // Select any terms that might be in this line
        // Ordered by descending length of term,
        // so that the longest terms get replaced first
        $result = db_query('SELECT title, term FROM {catholic_encyclopedia_terms} ' .
                "WHERE :text LIKE CONCAT('%', CONCAT(term, '%')) " .
                'GROUP BY term ' .
                'ORDER BY char_length(term) DESC',
                array(
                    ':text' => $line
                    ))
            ->fetchAll();

        // Array with lowercase term as key, title of entry as value
        $terms = array();

        // Array of the terms only in descending order of length
        $ordered_terms = array();

        foreach ($result as $r)
        {
            $terms[strtolower($r->term)] = $r->title;
            $ordered_terms[] = preg_quote($r->term);
        }

        // If no terms were returned, add the line and continue without parsing.
        if (empty($ordered_terms)) {
            $linked_text .= $line;
            continue;
        }

        // Do the replace
        // Get the regexp by joining $ordered_terms with |
        $line = preg_replace_callback('/\b('.
                    implode('|', $ordered_terms) .
                    ')\b/i', function ($matches) use($terms)
                {
                if ($matches[1]) {
                return "[cathenlink=" .
                $terms[strtolower($matches[1])] . "]" .
                $matches[1] . "[/cathenlink]";
                }
                },
                $line);

        $linked_text .= $line;
    }

    return $linked_text;
}

我正在做这样的preg_replace，所以它不会两次替换一个单词。我会使用strtr，但是没有办法确保它是一个完整的单词，而不仅仅是一个单词的一部分。

有没有办法让它更快？现在它很慢。

Answer 1

我认为LIKE关键字会降低您的速度。是indexed吗？

您可以找到一些线索here

Answer 2

你可以使用像Lucene这样的索引系统来索引天主教百科全书。我并不怀疑它经常变化，所以索引可以在每日贝司上更新。 Lucene是用Java编写的，但我知道Zend有一个可以读取索引的PHP模块。

Answer 3

好吧，我认为我这样做的方式可能效率最高。我想出的是将结果缓存一周，这样就不必每周对帖子进行多次解析。实施这个解决方案后，我看到我的网站速度明显提高，所以它似乎正在发挥作用。

用链接替换单词

3 个答案: