使用utf8proc将c ++字符串转换为utf8有效字符串

时间:2012-10-24 11:03:47

标签: c++ string utf-8

我有一个std :: string输出。使用utf8proc我想将其转换为有效的utf8字符串。 http://www.public-software-group.org/utf8proc-documentation

typedef int int32_t;
#define ssize_t int
ssize_t utf8proc_reencode(int32_t *buffer, ssize_t length, int options)
Reencodes the sequence of unicode characters given by the pointer buffer and length as UTF-8. The result is stored in the same memory area where the data is read. Following flags in the options field are regarded: (Documentation missing here) In case of success the length of the resulting UTF-8 string is returned, otherwise a negative error code is returned.
WARNING: The amount of free space being pointed to by buffer, has to exceed the amount of the input data by one byte, and the entries of the array pointed to by str have to be in the range of 0x0000 to 0x10FFFF, otherwise the program might crash!

首先,如何在结尾添加额外的字节?那么如何从std :: string转换为int32_t * buffer?

这不起作用:

std::string g = output();
fprintf(stdout,"str: %s\n",g.c_str());
g += " ";   //add an extra byte?? 
g = utf8proc_reencode((int*)g.c_str(), g.size()-1, 0);
fprintf(stdout,"strutf8: %s\n",g.c_str());  

1 个答案:

答案 0 :(得分:0)

你很可能实际上并不想要utf8proc_reencode() - 该函数需要一个有效的UTF-32缓冲区并将其转换为有效的UTF-8缓冲区,但是因为你说你不知道你的数据编码是什么在那时你不能使用那个功能。

首先,您需要确定数据的实际编码方式。您可以使用http://utfcpp.sourceforge.net/来测试您是否已使用utf8::is_valid(g.begin(), g.end())的有效UTF-8。如果那是真的,那你已经完成了!

如果错误,事情变得复杂......但ICU(http://icu-project.org/)可以帮助你;见http://userguide.icu-project.org/conversion/detection

一旦你在某种程度上可靠地知道你的数据编码是什么,ICU可以再次帮助它获得UTF-8。例如,假设您的源数据g在ISO-8859-1中:

UErrorCode err = U_ZERO_ERROR; // check this after every call...
// CONVERT FROM ISO-8859-1 TO UChar
UConverter *conv_from = ucnv_open("ISO-8859-1", &err);
std::vector<UChar> converted(g.size()*2); // *2 is usually more than enough
int32_t conv_len = ucnv_toUChars(conv_from, &converted[0], converted.size(), g.c_str(), g.size(), &err);
converted.resize(conv_len);
ucnv_close(conv_from);
// CONVERT FROM UChar TO UTF-8
g.resize(converted.size()*4);
UConverter *conv_u8 = ucnv_open("UTF-8", &err);
int32_t u8_len = ucnv_fromUChars(conv_u8, &g[0], g.size(), &converted[0], converted.size(), &err);
g.resize(u8_len);
ucnv_close(conv_u8);
之后您的g现在持有UTF-8数据。