Question

我将字典条目存储在Lua表中，并将其用作数组。我想从Lua中对条目进行排序，这样我就可以添加新的条目，而不必自己移动到正确的位置（这很快就会变得很乏味）。但是，我遇到了几个问题：

许多单词包含非ASCII字符，这使得字符串的内置比较运算符不适合该任务（例如，它使 amputar 出现在ámbito之前）
有各种语言的语言（尽管都是西方语言），即西班牙语，德语和英语。这里的问题是，不同的语言可能有不同的字母顺序概念。由于主要语言是西班牙语，我想使用它的规则，虽然我不确定这是否适用于西班牙语字母中未包含的字符。
有些单词包含大写字母，或者更糟糕的是，从它们开始。例如，所有德语名词都以大写字母开头。通过内置的比较运算符，大写字母出现在他们的小写兄弟之前，这不是我想要的行为;我希望将大写字母视为小写字母。

例如，如下表所示：

local entries =
{
    'amputar',
    'Volksgeist',
    'ámbito'
}

这些条目应按如下方式订购：

ámbito
  amputar
  Volksgeist

但是，使用我当前的代码，输出错误：

local function compare_utf8_strings( o1 , o2 )
    -- Using the built-in non-UTF-8-aware non-locale-aware string comparison operator
    return o1 < o2
end

table.sort( entries , function ( a , b ) return compare_utf8_strings( a , b ) end )

for i, entry in ipairs(entries) do
    print( entry )
end

输出：

Volksgeist
  amputar
  ámbito

请您使用以下代码，并将其破解以满足我的要求？

local entries =
{
    'amputar',
    'Volksgeist',
    'ámbito'
}

local function compare_utf8_strings( o1 , o2 )
    -- Hack here, please, accomplishing my requirements
end

table.sort( entries , function ( a , b ) return compare_utf8_strings( a , b ) end )

for i, entry in ipairs(entries) do
    print( entry )
end

它应输出：

ámbito
  amputar
  Volksgeist

作为附加要求，这个Lua代码全部在LuaTeX内部，目前支持5.2版本的语言。至于外部库，我想可以使用它们。

我是Lua阵营的新手，所以，请原谅我所犯的任何错误，并随时通知它，所以我解决了。

Answer 1

经过一段时间寻找无济于事，我找到了Joseph Wright的this article。尽管它触及了我的问题，但它没有提供明确的解决方案。我问他，事实证明，目前没有直接的办法去做我想做的事。然而，他指出， slnunicode 内置了LuaTeX（虽然它将来会被替换）。

我使用LuaTeX环境中提供的工具开发了一个“原始”解决方案。它不优雅，但它可以工作，并且它不会拉动任何外部依赖。关于它的效率，我没有发现文档构建时间有任何差异。

-- Make the facilities available
unicode = require( 'unicode' )
utf8 = unicode.utf8

--[[
    Each character's position in this array-like table determines its 'priority'.
    Several characters in the same slot have the same 'priority'.
]]
local alphabet =
{
    -- The space is here because of other requirements of my project
    { ' ' },
    { 'a', 'á', 'à', 'ä' },
    { 'b' },
    { 'c' },
    { 'd' },
    { 'e', 'é', 'è', 'ë' },
    { 'f' },
    { 'g' },
    { 'h' },
    { 'i', 'í', 'ì', 'ï' },
    { 'j' },
    { 'k' },
    { 'l' },
    { 'm' },
    { 'n' },
    { 'ñ' },
    { 'o', 'ó', 'ò', 'ö' },
    { 'p' },
    { 'q' },
    { 'r' },
    { 's' },
    { 't' },
    { 'u', 'ú', 'ù', 'ü' },
    { 'v' },
    { 'w' },
    { 'x' },
    { 'y' },
    { 'z' }
}

-- Looks up the character `character´ in the alphabet and returns its 'priority'
local function get_pos_in_alphabet( character )
    for i, alphabet_entry in ipairs(alphabet) do
        for _, alphabet_char in ipairs(alphabet_entry) do
            if character == alphabet_char then
                return i
            end
        end
    end

    --[[
        If it isn't in the alphabet, abort: it's better than silently outputting some
        random garbage, and, thanks to the message, allows to add the character to
        the table.
    ]]
    assert( false , "'" .. character .. "' was not in alphabet" )
end

-- Returns the characters in the UTF-8-encoded string `s´ in an array-like table
local function get_utf8_string_characters( s )
    --[[
        I saw this variable being used in several code snippets around the Web, but
        it isn't provided in my LuaTeX environment; I use this form of initialization
        to be safe if it's defined in the future.
    ]]
    utf8.charpattern = utf8.charpattern or "([%z\1-\127\194-\244][\128-\191]*)"

    local characters = {}

    for character in s:gmatch(utf8.charpattern) do
        table.insert( characters , character )
    end

    return characters
end

local function compare_utf8_strings( _o1 , _o2 )
    --[[
        `o1_chars´ and `o2_chars´ are array-like tables containing all of the
        characters of each string, which are all made lower-case using the
        slnunicode facilities that come built-in with LuaTeX.
    ]]
    local o1_chars = get_utf8_string_characters( utf8.lower(_o1) )
    local o2_chars = get_utf8_string_characters( utf8.lower(_o2) )

    local o1_len = utf8.len(o1)
    local o2_len = utf8.len(o2)

    for i = 1, math.min( o1_len , o2_len ) do
        o1_pos = get_pos_in_alphabet( o1_chars[i] )
        o2_pos = get_pos_in_alphabet( o2_chars[i] )

        if o1_pos > o2_pos then
            return false
        elseif o1_pos < o2_pos then
            return true
        end
    end

    return o1_len < o2_len
end

我无法在问题框架中集成此解决方案，因为我的测试环境ZeroBrane Studio Lua IDE没有附带 slnunicode ，我不知道如何添加它。

就是这样。如果有任何疑问或想要进一步解释，请使用评论。我希望它对其他人有用。

按字母顺序对包含UTF-8编码值的表进行排序

1 个答案: