如何按标题的字母排序顺序排序(忽略The,An等)并使用索引

时间:2013-05-06 14:38:29

标签: postgresql

我有一个带有title字段的PostgreSQL表,但这些标题通常在前面包含“The”或“An”,我需要一种方法来按字母顺序对这些记录进行排序,因为库会忽略这些文章在进行排序时。

两个问题

  1. 在SQL中编写此ORDER BY表达式的最佳方法是什么?

  2. 如何在标题字段上构建和使用适当的索引,而不将标题字段值的子字符串复制到类似“alphabetical_title”字段并将其编入索引?

  3. 我正在寻找为PostgreSQL量身定制的解决方案。感谢。

2 个答案:

答案 0 :(得分:3)

您可以在表达式上添加索引:

create index on yourtable (natural_sort(title));

然后Postgres会在适当的时候使用索引,并且实际上不会实际计算natural_sort(title) - 除非您也选择它。

正如所说的那样(和tsvector字段非常相似)如果您出于性能原因实际存储了预先计算的结果,那么您将获得更好的性能。如果在上述情况下,Postgres决定不以任何理由使用该索引,则需要为所考虑的每一行实际计算它,这将对您的查询产生很大的影响。

在任何一种情况下,都不要忘记数字:

http://www.codinghorror.com/blog/2007/12/sorting-for-humans-natural-sort-order.html


以下是两个让您开始自然排序的功能:

/**
 * @param text _str The input string.
 * @return text The output string for consumption in natural sorting.
 */
CREATE OR REPLACE FUNCTION natsort(text)
    RETURNS text
AS $$
DECLARE
    _str    text := $1;
    _pad    int := 15; -- Maximum precision for PostgreSQL floats
BEGIN
    -- Bail if the string is empty
    IF  trim(_str) = ''
    THEN
        RETURN '';
    END IF;

    -- Strip accents and lower the case
    _str := lower(unaccent(_str));

    -- Replace nonsensical characters
    _str := regexp_replace(_str, E'[^a-z0-9$¢£¥₤€@&%\\(\\)\\[\\]\\{\\}_:;,\\.\\?!\\+\\-]+', ' ', 'g');

    -- Trim the result
    _str := trim(_str);

    -- @todo we'd ideally want to strip leading articles/prepositions ('a', 'the') at this stage,
    --       but to_tsvector()'s default dictionary also strips stop words (e.g. 'all').

    -- We're done if the string contains no numbers
    IF  _str !~ '[0-9]'
    THEN
        RETURN _str;
    END IF;

    -- Force spaces between numbers, so we can use regexp_split_to_table()
    _str := regexp_replace(_str, E'((?:[0-9]+|[0-9]*\\.[0-9]+)(?:e[+-]?[0-9]+\\M)?)', E' \\1 ', 'g');

    -- Pad zeros to obtain a reasonably natural looking sort order
    RETURN array_to_string(ARRAY(
    SELECT  CASE
            WHEN val !~ E'^\\.?[0-9]'
            -- Not a number; return as is
            THEN val
            -- Do our best after expanding the number...
            ELSE COALESCE(lpad(substring(val::numeric::text from '^[0-9]+'), _pad, '0'), '') ||
                COALESCE(rpad(substring(val::numeric::text from E'\\.[0-9]+'), _pad, '0'), '')
            END
    FROM    regexp_split_to_table(_str, E'\\s+') as val
    WHERE   val <> ''
    ), ' ');
END;
$$ IMMUTABLE STRICT LANGUAGE plpgsql COST 1;

COMMENT ON FUNCTION natsort(text) IS
'Rewrites a string so it can be used in natural sorting.

It''s by no means bullet proof, but it works properly for positive integers,
reasonably well for positive floats, and it''s fast enough to be used in a
trigger that populates an indexed column, or in an index directly.';

/**
 * @param text[] _values The potential values to use.
 * @return text The output string for consumption in natural sorting.
 */
CREATE OR REPLACE FUNCTION sort(text[])
    RETURNS text
AS $$
DECLARE
    _values     alias for $1;
    _sort       text;
BEGIN
    SELECT  natsort(value)
    INTO    _sort
    FROM    unnest(_values) as value
    WHERE   value IS NOT NULL
    AND     value <> ''
    AND     natsort(value) <> ''
    LIMIT 1;

    RETURN COALESCE(_sort, '');
END;
$$ IMMUTABLE STRICT LANGUAGE plpgsql COST 1;

COMMENT ON FUNCTION sort(text[]) IS
'Returns natsort() of the first significant input argument.';

第一个函数的单元测试的示例输出:

public function testNatsort()
{
    $this->checkInOut('natsort', array(
        '<NULL>'                => null,
        ''                      => '',
        'ABCde'                 => 'abcde',
        '12345 12345'           => '000000000012345 000000000012345',
        '12345.12345'           => '000000000012345.123450000000000',
        '12345e5'               => '000001234500000',
        '.12345e5'              => '000000000012345',
        '1e10'                  => '000010000000000',
        '1.2e20'                => '120000000000000',
        '-12345e5'              => '- 000001234500000',
        '-.12345e5'             => '- 000000000012345',
        '-1e10'                 => '- 000010000000000',
        '-1.2e20'               => '- 120000000000000',
        '+-$¢£¥₤€@&%'           => '+-$¢£¥₤€@&%',
        'ÀÁÂÃÄÅĀĄĂÆ'            => 'aaaaaeaaaaaae',
        'ÈÉÊËĒĘĚĔĖÐ'            => 'eeeeeeeeee',
        'ÌÍÎÏĪĨĬĮİIJ'            => 'iiiiiiiiiij',
        'ÒÓÔÕÖØŌŐŎŒ'            => 'oooooeoooooe',
        'ÙÚÛÜŪŮŰŬŨŲ'            => 'uuuueuuuuuu',
        'ÝŶŸ'                   => 'yyy',
        'àáâãäåāąăæ'            => 'aaaaaeaaaaaae',
        'èéêëēęěĕėð'            => 'eeeeeeeeee',
        'ìíîïīĩĭįıij'            => 'iiiiiiiiiij',
        'òóôõöøōőŏœ'            => 'oooooeoooooe',
        'ùúûüūůűŭũų'            => 'uuuueuuuuuu',
        'ýÿŷ'                   => 'yyy',
        'ÇĆČĈĊ'                 => 'ccccc',
        'ĎĐ'                    => 'dd',
        'Ƒ'                     => 'f',
        'ĜĞĠĢ'                  => 'gggg',
        'ĤĦ'                    => 'hh',
        'Ĵ'                     => 'j',
        'Ķ'                     => 'k',
        'ŁĽĹĻĿ'                 => 'lllll',
        'ÑŃŇŅŊ'                 => 'nnnnn',
        'ŔŘŖ'                   => 'rrr',
        'ŚŠŞŜȘſ'                => 'sssssss',
        'ŤŢŦȚÞ'                 => 'ttttt',
        'Ŵ'                     => 'w',
        'ŹŽŻ'                   => 'zzz',
        'çćčĉċ'                 => 'ccccc',
        'ďđ'                    => 'dd',
        'ƒ'                     => 'f',
        'ĝğġģ'                  => 'gggg',
        'ĥħ'                    => 'hh',
        'ĵ'                     => 'j',
        'ĸķ'                    => 'kk',
        'łľĺļŀ'                 => 'lllll',
        'ñńňņʼnŋ'                => 'nnnnnn',
        'ŕřŗ'                   => 'rrr',
        'śšşŝșß'                => 'sssssss',
        'ťţŧțþ'                 => 'ttttt',
        'ŵ'                     => 'w',
        'žżź'                   => 'zzz',
        '-_aaa--zzz--'          => '-_aaa--zzz--',
        '-:àáâ;-žżź--'          => '-:aaa;-zzz--',
        '-.à$â,-ž%ź--'          => '-.a$a,-z%z--',
        '--à$â--ž%ź--'          => '--a$a--z%z--',
        '-$à(â--ž)ź%-'          => '-$a(a--z)z%-',
        '#-à$â--ž?!ź-'          => '-a$a--z?!z-',
    ));

答案 1 :(得分:0)

  1. 您可以在PostgreSQL中使用各种字符串函数,但也许您最好使用文本索引,请参阅http://www.postgresql.org/docs/9.2/static/textsearch.html

  2. 正如Denis所提到的,你可以在PostgreSQL中索引一个表达式,这样你就可以索引你正在搜索的同一个表达式。