Question

我有点困惑，因为我打开了question，我想在这里更具体一点。

我有很多文件包含德语字母，主要是 iso-8859-15 或 UTF-8 编码。为了处理它们，必须将所有字母转换为小写。

例如，我有一个文件（编码为 iso-8859-15 ），其中包含：

博士。 M. Das sogen的玫瑰。 Baptisterium zu Winland，eins der im Art。   “Baukunst”（第496页）erwähntenRundgebäude在Grönland，soll nach   Palfreys“新英格兰的历史”eine von dem Gouverneur Arnold嗯   1670年erbauteWindmühlesein。 VGL。阵风。风暴在书房“Jahrbüchernder   Kopenhagen的königlichenGesellschaftfürnordischeAltertumskunde“   1887年，第296页。

ÄäÖöÜüẞßÖrebro

文字Ää Öö Üü ẞß Örebro应该变为：ää öö üü ßß örebro。

然而，tolower()似乎不适用于大写字母，例如Ä，Ö，Ü，ẞ，尽管我尝试强迫this SO post中提到的语言环境

以下是我在其他问题中发布的相同代码：

std::vector<std::string> tokens;
std::string filename = "10223-8.txt";
//std::string filename = "test-UTF8.txt";
std::ifstream inFile;

//std::setlocale(LC_ALL, "en_US.iso88591");
//std::setlocale(LC_ALL, "de_DE.iso88591");
//std::setlocale(LC_ALL, "en_US.iso88591");
//std::locale::global(std::locale(""));

inFile.open(filename);
if (!inFile) { std::cerr << "Failed to open file" << std::endl; exit(1); }

std::string s = "";
std::string line;
while( (inFile.good()) && std::getline(inFile, line) ) {
    s.append(line + "\n");
}
inFile.close();

std::cout << s << std::endl;

//std::setlocale(LC_ALL, "de_DE.iso88591");
for (unsigned int i = 0; i < s.length(); ++i) {
    if (std::ispunct(s[i]) || std::isdigit(s[i]))
            s[i] = ' ';
    if (std::isupper(s[i]))
            s[i] = std::tolower(s[i]);
            //s[i] = std::tolower(s[i]);
            //s[i] = std::tolower(s[i], std::locale("de_DE.utf8"))
}

std::cout << s << std::endl;

//tokenize string
std::istringstream iss(s);
tokens.clear();
tokens = {std::istream_iterator<std::string>{iss}, std::istream_iterator<std::string>{}};

//PROCESS TOKENS...

它非常令人沮丧，关于<locale>的使用的范例并不多。

因此，除了我的代码的主要问题，这里有一些问题：

我是否还必须在其他功能中应用某种自定义区域设置（isupper()，ispunct() ...）？
我是否需要在我的linux de_DE中启用或安装env区域设置才能正确处理字符串的字符？
以同样的方式处理文本std::string是否安全从具有不同编码的文件中提取（iso-8859-15或UTF-8）？

How to apply functions on text files with different encoding in c++

Answer 1

使用std::ctype::tolower，而不是std::tolower：

#include <iostream>
#include <locale>

int main() {
    std::locale::global(std::locale("de_DE.UTF-8"));
    std::wcout.imbue(std::locale());
    auto& f = std::use_facet<std::ctype<wchar_t>>(std::locale());
    std::wstring str = L"Ää Öö Üü ẞß Örebro";
    f.tolower(&str[0], &str[0] + str.size());
    std::wcout << "'" << str << "'\n";
}

您可以创建本地区域设置（heh），而不是设置全局区域设置：

std::locale loc("de_DE.UTF-8");
std::wcout.imbue(loc);
auto& f = std::use_facet<std::ctype<wchar_t>>(loc);

这编译并“工作”。在我的系统上，它正确地转换了变音符号，但是它无法处理大写字母（不出所料，说实话）。

此外，请注意此功能的限制：它只能执行1对1的字符转换。在以前版本的Unicode标准中，“ß”的正确大写转换是“SS”。 std::ctype::toupper明确表示从未支持此内容。

如何在C ++中正确应用tolower（）德语大写字母Ä，Ö，Ü，。

1 个答案: