将Unicode的UTF8表示写入文件

时间:2015-04-06 15:17:52

标签: c++ unicode encoding utf-8

我有一个专有文件(数据库)格式,我目前正在尝试迁移到SQL数据库。因此我将文件转换为sql转储,已经正常工作。现在唯一的问题是它们处理不在32到126的ASCII小数范围内的字符的奇怪方式。它们具有以Unicode(十六进制 - 例如20AC =€)存储的所有那些字符的集合,由它们自己索引内部指数。

我现在的计划是:我想创建一个表,其中存储了内部索引,unicode(十六进制)和字符表示(UTF-8)。然后,此表可用于将来的更新。

现在问题:如何将unicode十六进制值的UTF-8字符表示写入文件?当前代码如下所示:

this->outFile.open(fileName + ".sql", std::ofstream::app);
std::string protyp;
this->inFile.ignore(2); // Ignore the ID = 01.
std::getline(this->inFile, protyp); // Get the PROTYP Identifier (e.g. \321)
protyp = "\\" + protyp;

std::string unicodeHex;
this->inFile.ignore(2); // Ignore the ID = 01.
std::getline(this->inFile, unicodeHex); // Get the Unicode HEX Identifier (e.g. 002C)

std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
const std::wstring wide_string = this->s2ws("\\u" + unicodeHex);
const std::string utf8_rep = converter.to_bytes(wide_string);

std::string valueString = "('" + protyp + "', '" + unicodeHex + "', '" + utf8_rep + "')";

this->outFile << valueString << std::endl;

this->outFile.close();

但这只是打印出这样的东西:

('\321', '002C', '\u002C'),

虽然所需的输出是:

('\321', '002C', ','),

我做错了什么?我不得不承认,在字符编码方面,我并不确定:/。我正在使用Windows 7 64位,如果它有任何区别。 提前谢谢。

1 个答案:

答案 0 :(得分:1)

正如@Mark Ransom在评论中指出的,我最好的办法是将十六进制字符串转换为整数并使用它。 这就是我所做的:

unsigned int decimalHex = std::stoul(unicodeHex, nullptr, 16);;

std::string valueString = "('" + protyp + "', '" + unicodeHex + "', '" + this->UnicodeToUTF8(decimalHex) + "')";

虽然UnicodeToUTF8的功能取自Unsigned integer as UTF-8 value

std::string UnicodeToUTF8(unsigned int codepoint)
{
    std::string out;

    if (codepoint <= 0x7f)
        out.append(1, static_cast<char>(codepoint));
    else if (codepoint <= 0x7ff)
    {
        out.append(1, static_cast<char>(0xc0 | ((codepoint >> 6) & 0x1f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    else if (codepoint <= 0xffff)
    {
        out.append(1, static_cast<char>(0xe0 | ((codepoint >> 12) & 0x0f)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    else
    {
        out.append(1, static_cast<char>(0xf0 | ((codepoint >> 18) & 0x07)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 12) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    return out;
}