我有txt文件,其内容为:
\ u041f \ u0435 \ u0440 \ u0432 \ u044b \ u0439_ \ u0438 \ u043d \ u0442 \ u0435 \ u0440 \ u0430 \ u043a \ u0442 \ u0438 \ u0432 \ u043d \ u044b \ u0439_ \ u0438 \ u043d \ u0442 \ u0435 \ u0440 \ u043d \ u0435 \ u0442_ \ u043a \ u0430 \ u043d \ u0430 \ u043b
如何阅读此类文件以获得如下结果:
“Первый_интерактивный_интернет_канал”
如果我输入:
string str = _T("\u041f\u0435\u0440\u0432\u044b\u0439_\u0438\u043d\u0442\u0435\u0440\u0430\u043a\u0442\u0438\u0432\u043d\u044b\u0439_\u0438\u043d\u0442\u0435\u0440\u043d\u0435\u0442_\u043a\u0430\u043d\u0430\u043b");
然后str
的结果是好的,但如果我从文件中读取它,那么它就像在文件中一样。我想这是因为'\ u'变成'\ u'。
有没有简单的方法将\ uxxxx表示法转换为C ++中的相应符号?
答案 0 :(得分:2)
当您阅读文件时,这并不容易。之后进行后处理步骤会更容易。您可以使用Boost::regex
查找“\ u [0-9A-Fa-f] {4}”模式,并将其替换为相应的单个字符。
答案 1 :(得分:2)
以下是MSalters建议的一个示例:
#include <iostream>
#include <string>
#include <fstream>
#include <algorithm>
#include <sstream>
#include <iomanip>
#include <locale>
#include <boost/scoped_array.hpp>
#include <boost/regex.hpp>
#include <boost/numeric/conversion/cast.hpp>
std::wstring convert_unicode_escape_sequences(const std::string& source) {
const boost::regex regex("\\\\u([0-9A-Fa-f]{4})"); // NB: no support for non-BMP characters
boost::scoped_array<wchar_t> buffer(new wchar_t[source.size()]);
wchar_t* const output_begin = buffer.get();
wchar_t* output_iter = output_begin;
std::string::const_iterator last_match = source.begin();
for (boost::sregex_iterator input_iter(source.begin(), source.end(), regex), input_end; input_iter != input_end; ++input_iter) {
const boost::smatch& match = *input_iter;
output_iter = std::copy(match.prefix().first, match.prefix().second, output_iter);
std::stringstream stream;
stream << std::hex << match[1].str() << std::ends;
unsigned int value;
stream >> value;
*output_iter++ = boost::numeric_cast<wchar_t>(value);
last_match = match[0].second;
}
output_iter = std::copy(last_match, source.end(), output_iter);
return std::wstring(output_begin, output_iter);
}
int wmain() {
std::locale::global(std::locale(""));
const std::wstring filename = L"test.txt";
std::ifstream stream(filename.c_str(), std::ios::in | std::ios::binary);
stream.seekg(0, std::ios::end);
const std::ifstream::streampos size = stream.tellg();
stream.seekg(0);
boost::scoped_array<char> buffer(new char[size]);
stream.read(buffer.get(), size);
const std::string source(buffer.get(), size);
const std::wstring result = convert_unicode_escape_sequences(source);
std::wcout << result << std::endl;
}
我总是感到惊讶的是,像C ++这样复杂的看似简单的东西。
答案 2 :(得分:0)
我的解决方案。我使用Boost进行UTF-16 - UTF-8转换。
#include <fstream>
#include <codecvt>
#include <boost/numeric/conversion/cast.hpp>
//------------------------------------------------------------------------------
inline uint8_t get_uint8(uint8_t h, uint8_t l)
{
uint8_t ret;
if (h - '0' < 10)
ret = h - '0';
else if (h - 'A' < 6)
ret = h - 'A' + 0x0A;
else if (h - 'a' < 6)
ret = h - 'a' + 0x0A;
ret = ret << 4;
if (l - '0' < 10)
ret |= l - '0';
else if (l - 'A' < 6)
ret |= l - 'A' + 0x0A;
else if (l - 'a' < 6)
ret |= l - 'a' + 0x0A;
return ret;
}
std::string convert_unicode_escape_sequences(const std::string& source)
{
std::wstring ws; ws.reserve(source.size());
std::wstringstream wis(ws);
auto s = source.begin();
while (s != source.end())
{
if (*s == '\\')
{
if (std::distance(s, source.end()) > 5)
{
if (*(s + 1) == 'u')
{
unsigned int v = get_uint8(*(s + 2), *(s + 3)) << 8;
v |= get_uint8(*(s + 4), *(s + 5));
s += 6;
wis << boost::numeric_cast<wchar_t>(v);
continue;
}
}
}
wis << wchar_t(*s);
s++;
}
std::wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
return myconv.to_bytes(wis.str());
}
答案 3 :(得分:-1)
检查此代码:) Windows SDK已经为您准备好了,MS geeks也考虑过这一点,您可以在这篇文章中找到更多详细信息:http://weblogs.asp.net/kennykerr/archive/2008/07/24/visual-c-in-short-converting-between-unicode-and-utf-8.aspx
#include <atlconv.h>
#include <atlstr.h>
#define ASSERT ATLASSERT
int main()
{
const CStringW unicode1 = L"\u041f and \x03A9"; // 'Alpha' and 'Omega'
const CStringA utf8 = CW2A(unicode1, CP_UTF8);
ASSERT(utf8.GetLength() > unicode1.GetLength());
const CStringW unicode2 = CA2W(utf8, CP_UTF8);
ASSERT(unicode1 == unicode2);
return 0;
}
此代码已经过我测试,效果很好。