我正在尝试将UTF-8编码的文件读入UTF-32(UCS-4)字符串。基本上在内部我想在应用程序内部使用固定大小的字符。
这里我想确保转换是作为流处理的一部分完成的(因为这是Locale应该用于的)。已经发布了替代问题来对字符串进行翻译(但这很浪费,因为你必须在内存中进行翻译阶段,然后你必须做第二遍才能将它发送到流中)。通过使用流中的区域设置,您只需要执行一次传递,并且不需要复制(假设您要保留原始文件)。
这就是我的尝试。
#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>
int main()
{
std::locale converter(std::locale(), new std::codecvt_utf8<char32_t>);
std::basic_ifstream<char32_t> iFile;
iFile.imbue(converter);
iFile.open("test.data");
std::u32string line;
while(std::getline(iFile, line))
{
}
}
由于这些都是标准类型,我对这个编译错误感到惊讶:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/istream:275:41:
error: no matching function for call to 'use_facet'
const ctype<_CharT>& __ct = use_facet<ctype<_CharT> >(__is.getloc());
^~~~~~~~~~~~~~~~~~~~~~~~~
编译:
g++ -std=c++14 test.cpp
答案 0 :(得分:1)
似乎char32_t
不是我想要的。只需转移到wchar_t
为我工作。我怀疑这只能按照我想要的方式在Linux
系统和Windows上进行,这种转换将转换为UTF-16(UCS-2)(但我无法测试)。
int main()
{
std::locale utf8_to_utf32(std::locale(), new std::codecvt_utf8<wchar_t>);
// Input stream reads UTF-8 and converts to UTF-32 (UCS-4) String
std::wifstream iFile("test.data");
iFile.imbue(utf8_to_utf32);
// Output UTF-32 (UCS-4) string converts to UTF-8 stream
std::wofstream oFile("test.res");
oFile.imbue(utf8_to_utf32);
// Now just read like you would normally.
std::wstring line;
while(std::getline(iFile, line))
{
// UTF-32 characters are fixed size.
// So reverse is simple just do it in-place.
std::reverse(std::begin(line), std::end(line));
// UTF-32 unfortunately also has grapheme clusters (these are groups of characters
// that are displayed as a single glyph). By doing the reverse above we have split
// these incorrectly. We need to do a second pass to reverse the characters inside
// each cluster. This is beyond the scope of this question and left as an excursive
// (but I may come back to it later).
oFile << line << "\n";
}
}
上面的评论表明,这比阅读数据要慢于内联翻译。所以我做了一些测试:
// read1.cpp使用codecvt和Locale
在流中进行翻译#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>
int main()
{
std::locale utf8_to_utf32(std::locale(), new std::codecvt_utf8<wchar_t>);
std::wifstream iFile("test.data");
iFile.imbue(utf8_to_utf32);
std::wofstream oFile("test.res1");
oFile.imbue(utf8_to_utf32);
std::wstring line;
while(std::getline(iFile, line))
{
std::reverse(std::begin(line), std::end(line));
oFile << line << "\n";
}
}
// read2.cpp阅读后使用codecvt进行翻译。
#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>
#include <string>
int main()
{
std::ifstream iFile("test.data");
std::ofstream oFile("test.res2");
std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8_to_utf32;
std::string line;
std::wstring wideline;
while(std::getline(iFile, line))
{
wideline = utf8_to_utf32.from_bytes(line);
std::reverse(std::begin(wideline), std::end(wideline));
oFile << utf8_to_utf32.to_bytes(wideline) << "\n";
}
}
// read3.cpp使用UTF-8
#include <algorithm>
#include <iostream>
#include <string>
#include <fstream>
static bool is_lead(uint8_t ch) { return ch < 0x80 || ch >= 0xc0; }
/* Reverse a utf-8 string in-place */
void reverse_utf8(std::string& s) {
std::reverse(s.begin(), s.end());
for (auto p = s.begin(), end = s.end(); p != end; ) {
auto q = p;
p = std::find_if(p, end, is_lead);
std::reverse(q, ++p);
}
}
int main(int argc, char** argv)
{
std::ifstream iFile("test.data");
std::ofstream oFile("test.res3");
std::string line;
while(std::getline(iFile, line))
{
reverse_utf8(line);
oFile << line << "\n";
}
return 0;
}
测试文件是58M的unicode japanese
> ls -lah test.data
-rw-r--r-- 1 loki staff 58M Jan 28 11:28 test.data
> g++ -O3 -std=c++14 read1.cpp -o a1
> g++ -O3 -std=c++14 read2.cpp -o a2
> g++ -O3 -std=c++14 read3.cpp -o a3
>
> # This is the one using Locale in stream
> time ./a1
real 0m0.645s
user 0m0.521s
sys 0m0.108s
>
> # This is the one doing translation after reading.
> time ./a2
real 0m1.058s
user 0m0.916s
sys 0m0.123s
>
> # This is the one using UTF-8
> time ./a3
real 0m0.785s
user 0m0.663s
sys 0m0.104s
在流中进行转换更快但不是很明显(不是很多数据)。所以选择一个易于阅读的文章。