Question

如何搜索前500个字符，不包括html标记？

到目前为止，我想出了以下内容，即搜索文本中出现的关键字

    SELECT *
    FROM root_pages

    WHERE root_pages.pg_cat_id = '2'
    AND root_pages.parent_id != root_pages.pg_id
    AND root_pages.pg_hide != '1'
    AND root_pages.pg_url != 'cms'
    AND root_pages.pg_content_1 REGEXP '[[:<:]]".$search."[[:>:]]'
    OR root_pages.pg_content_2 REGEXP '[[:<:]]".$search."[[:>:]]'

ORDER BY root_pages.pg_created DESC

如何在其中添加更多条件 - 前500个不包含html标签的字母？

如果它只能在第一段上搜索关键字，那将是完美的 - 是否可能？

编辑：

感谢帮助人员！这是我的解决方案：

    # query to search for “whole word match” in SQL only, e.g. when I search for "rid", it should not match "arid", but it should match "a rid".
    # you can use REGEXP and the [[:<:]] and [[:>:]] word-boundary markers:
    $sql = "
    SELECT *
    FROM root_pages

    WHERE root_pages.pg_cat_id = '2'
    AND root_pages.parent_id != root_pages.pg_id
    AND root_pages.pg_hide != '1'
    AND root_pages.pg_url != 'cms'
    AND root_pages.pg_content_1 REGEXP '[[:<:]]".$search."[[:>:]]'
    OR root_pages.pg_content_2 REGEXP '[[:<:]]".$search."[[:>:]]'

    ORDER BY root_pages.pg_created DESC
    ";

    # use the instantiated db connection object from the init.php, to process the query
    $items = $connection -> fetch_all($sql);
    $total_item = $connection -> num_rows($sql);

    if ($total_item > 0)
    {
        foreach($items as $item)
        {
            # get the content
            if(empty($item['pg_content_2'])) $pg_content = strip_tags($item['pg_content_1']);
                else $pg_content = strip_tags($item['pg_content_2']);

            # get the first 500 letters only
            $pg_content = substr($pg_content, 0, 500);

            # get the matches
            if (preg_match("/\b(".$search.")\b/", $pg_content)) 
            {
                $match[] = $pg_content;
            }

        }

        $total_match = count($match);
        //echo $count;
    }

    if($total_match > 0)
    {
        echo '<result message="'.$total_match.' matches found! Please wait while redirecting." search="'.$search.'"/>';
    }
    else
    {
        echo '<error elementid="input" message="Sorry no results are found."/>';
    }

Answer 1

它并不像剥离/跳过标签那么简单 - 你会发现前500个字符通常位于<style>内<script>或<head>内。{/ p>

此外，只需删除标签即可：

separate<br>words

如果你想正确地做，我建议在文本输出模式中使用XSLT样式表，通过在块级元素周围添加空格，删除脚本，<head>等将HTML转换为纯文本。

一种更简单的方法，杀死小猫，将使用一系列正则表达式而不是XSLT来预处理HTML。

将HTML转换为可用文本后，将该文本放在数据库的额外列中，并将其用于搜索。您甚至可以在其上添加FULLTEXT索引。

Answer 2

如果使用p元素定义段落：

... REGEXP '<p[^>]*>'".$search."'</p>'

不要忘记为正则表达式特殊字符转义$search。

Answer 3

有关：

如何在其中添加更多条件 - 前500个不包含html标签的字母？

您不能仅使用MySQL（至少在100％的情况下可以使用的解决方案） - 有关更多详细信息，请参阅Parsing Html The Cthulhu Way和此SO answer。

PHP strip_tags和substr有助于实现您的目标。

Answer 4

如果你真的希望能够从MySQL做到这一点，我认为最好的（意见）方法是拥有一个包含纯文本版本pg_content_1（和pg_content_2）的重复字段。

这会增加空间和内存开销，但会加快搜索过程中的处理速度。如果您有一个ORM库，您可以将事件挂钩到onSave并确保纯文本字段自动更新。

搜索前500个字母并排除html标签？

4 个答案: