我正在尝试将UTF-16字符串(从spidermonkey 19中的JSString获取)转换为UTF-8字符串。我认为转换的字符串是可以的,但由于某种原因,转换例程为每个unicode(非ascii)字符添加两个额外的字节。我很确定我做错了什么,我尝试了不同的编码而没有好的结果。这就是我现在要做的事情:
// UTF-16 string "áéíóúñ aeiou", this is the string being converted
// (you can find "aeiou" after \x20\x00, where \x61\x00 is "a")
\xC3\x00\xA1\x00\xC3\x00\xA9\x00\xC3\x00\xAD\x00\xC3\x00\xB3\x00\xC3\x00\xBA\x00\xC3\x00\xB1\x00\x20\x00\x61\x00\x65\x00\x69\x00\x6F\x00\x75\x00\x6E\x00
// UTF-8 string, test string, taken from:
// const char* cmp = "áéíóúñ aeiou"
// This is the result I'm looking for.
\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xba\xc3\xb1 aeiou
// UTF-8 string I'm getting after iconv(utf16, utf8)
\xc3\x83\xc2\xa1\xc3\x83\xc2\xa9\xc3\x83\xc2\xad\xc3\x83\xc2\xb3\xc3\x83\xc2\xba\xc3\x83\xc2\xb1 aeioun
如您所见,每个非ascii字符之间有两个额外的字节(\ x83 \ xc2)。谁知道为什么会这样?
这是我的转换例程:
shared_ptr<char> convertToUTF8(char* utf16string, size_t len) {
iconv_t cd = iconv_open("UTF-8", "UTF-16LE");
char* utf8;
size_t utf8len;
utf8len = len;
utf8 = (char *)calloc(utf8len, 1);
shared_ptr<char> outptr(utf8);
size_t converted = iconv(cd, &utf16string, &len, &utf8, &utf8len);
if (converted == (size_t)-1) {
fprintf(stderr, "iconv failed\n");
switch (errno) {
case EILSEQ:
fprintf(stderr, "Invalid multibyte sequence.\n");
break;
case EINVAL:
fprintf(stderr, "Incomplete multibyte sequence.\n");
break;
case E2BIG:
fprintf(stderr, "No more room (iconv).\n");
break;
default:
fprintf(stderr, "Error: %s.\n", strerror(errno));
break;
}
outptr = NULL;
}
iconv_close(cd);
assert(outptr);
return outptr;
}
我也尝试了this other question中的解决方案,但结果完全相同。想法为什么iconv增加额外的两个字节?如何将结果与手动创建的utf-8字符串匹配?
编辑:测试字符串的固定描述
答案 0 :(得分:0)
为什么不用“UTF16”或“UTF-16”代替“UTF-16LE”,来自'man iconv_open',似乎我们有8种不同的UTF-16编码,
UTF-16 // UTF-16BE // UTF-16LE // UTF-16 // utf16be应按// utf16le应按//
但是,我没有使用iconv的经验,但是我已经使用以下函数将JSString转换为gchar *,
gchar* gtweet_jsengine_jsval2gchar(GtweetTwitterClient *self, jsval value)
{
JSContext *jscontext = NULL;
JSString *string = NULL;
GError *error = NULL;
gunichar2 *utf16_string = NULL;
gsize utf16_length = 0;
glong rlen = 0;
glong wlen = 0;
gchar *ret = NULL;
jscontext = self->priv->jscontext;
JS_BeginRequest(jscontext);
string = JS_ValueToString(jscontext, value);
utf16_string = (gunichar2 *) JS_GetStringCharsAndLength(jscontext, string, &utf16_length);
ret = g_utf16_to_utf8(utf16_string, utf16_length, &rlen, &wlen, &error);
if(error)
{
g_printerr("%s: %d: %s [rlen: %ld wlen: %ld]\n", g_quark_to_string(error->domain), error->code, error->message, rlen, wlen);
return NULL;
}
JS_EndRequest(jscontext);
return ret;
}