Question

我的std :: strings以UTF-8编码，所以std :: string＆lt;操作员不会削减它。我怎么能比较2个utf-8编码的std :: strings？

它没有削减它是为了重音，é来自z，它不应该

由于

Answer 1

标准具有std::locale用于特定于语言环境的内容，例如整理（排序）。如果环境包含LC_COLLATE=en_US.utf8或类似，则此程序将根据需要对行进行排序。

#include <algorithm>
#include <functional>
#include <iostream>
#include <iterator>
#include <locale>
#include <string>
#include <vector>
class collate_in : public std::binary_function<std::string, std::string, bool> {
  protected:
    const std::collate<char> &coll;
  public:
    collate_in(std::locale loc)
        : coll(std::use_facet<std::collate<char> >(loc)) {}
    bool operator()(const std::string &a, const std::string &b) const {
        // std::collate::compare() takes C-style string (begin, end)s and
        // returns values like strcmp or strcoll.  Compare to 0 for results
        // expected for a less<>-style comparator.
        return coll.compare(a.c_str(), a.c_str() + a.size(),
                            b.c_str(), b.c_str() + b.size()) < 0;
    }
};
int main() {
    std::vector<std::string> v;
    copy(std::istream_iterator<std::string>(std::cin),
         std::istream_iterator<std::string>(), back_inserter(v));
    // std::locale("") is the locale from the environment.  One could also
    // std::locale::global(std::locale("")) to set up this program's global
    // first, and then use locale() to get the global locale, or choose a
    // specific locale instead of using the environment's.
    sort(v.begin(), v.end(), collate_in(std::locale("")));
    copy(v.begin(), v.end(),
         std::ostream_iterator<std::string>(std::cout, "\n"));
    return 0;
}

$ cat >file
f
é
e
d
^D
$ LC_COLLATE=C ./a.out file
d
e
f
é
$ LC_COLLATE=en_US.utf8 ./a.out file
d
e
é
f

引起我的注意，std::locale::operator()(a, b)存在，避免了我上面写的std::collate<>::compare(a, b) < 0包装。

#include <algorithm>
#include <iostream>
#include <iterator>
#include <locale>
#include <string>
#include <vector>
int main() {
    std::vector<std::string> v;
    copy(std::istream_iterator<std::string>(std::cin),
         std::istream_iterator<std::string>(), back_inserter(v));
    sort(v.begin(), v.end(), std::locale(""));
    copy(v.begin(), v.end(),
         std::ostream_iterator<std::string>(std::cout, "\n"));
    return 0;
}

Answer 2

如果你不想要词典排序（这是按字典顺序排序UTF-8编码的字符串会给你的），那么你需要将你的UTF-8编码字符串解码为UCS-2或UCS-4适当的，并应用您选择的合适的比较功能。

重申一点，UTF-8编码机制设计巧妙，如果您通过查看每个8位编码字节的数值进行排序，您将得到相同结果。如果您首先将字符串解码为Unicode并比较每个代码点的数值。

更新：您更新的问题表明您需要比纯粹的词典排序更复杂的比较功能。您需要解码UTF-8字符串并比较解码后的字符。

Answer 3

编码（UTF-8,16等）不是问题，而是容器本身是将字符串视为Unicode字符串还是8位（ASCII或Latin-1）字符串。

我找到Is there an STL and UTF-8 friendly C++ Wrapper for ICU, or other powerful Unicode library，可以帮到你。

Answer 4

一种选择是使用ICU整理器（http://userguide.icu-project.org/collation/api），它们提供适当的国际化“比较”方法，然后您可以使用它进行排序。

Chromium有一个小包装，应该易于复制和粘贴/重复使用

https://code.google.com/p/chromium/codesearch#chromium/src/base/i18n/string_compare.cc&sq=package:chromium&type=cs

对UTF-8字符串进行排序？

4 个答案: