Question

我发现很难相信我是遇到这个问题的第一个人，但是搜索了很长时间并没有找到解决方案。

我想使用strncpy，但要知道它是UTF8，所以它不会将utf8字符部分写入目标字符串。

否则，即使您知道源是（当源字符串大于最大长度时），您也永远无法确定结果字符串是否为有效的UTF8。

验证生成的字符串可以工作但是如果要调用它，最好有一个strncpy函数来检查它。

glib有g_utf8_strncpy但这会复制一定数量的unicode字符，而我正在寻找一个限制字节长度的复制函数。

要清楚，通过“utf8 aware”，我的意思是它不应超过目标缓冲区的限制，并且必须从不仅复制部分utf -8个字符。 （给定有效的utf-8输入必须永远不会导致utf-8输出无效）。

注意：

有些回复指出strncpy使所有字节都为空，并且它不会确保零终止，回想起来我应该要求知道utf8 strlcpy ，但是我不知道这个功能是否存在的时候。

Answer 1

我不确定UTF-8的意思是什么意思; strncpy复制字节，而不是字符串和缓冲区的大小也以字节为单位。如果你的意思是它只会复制完整的UTF-8字符，例如，如果下一个角色没有空间，我就停止了不知道这样的功能，但写起来不应该太难：

int
utf8Size( char ch )
{
    static int const sizeTable[] =
    {
        //  ...
    };
    return sizeTable( static_cast<unsigned char>( ch ) )
}

char*
stru8ncpy( char* dest, char* source, int n )
{
    while ( *source != '\0' && utf8Size( *source ) < n ) {
        n -= utf8Size( *source );
        switch ( utf8Size( ch ) ) {
        case 6:
            *dest ++ = *source ++;
        case 5:
            *dest ++ = *source ++;
        case 4:
            *dest ++ = *source ++;
        case 3:
            *dest ++ = *source ++;
        case 2:
            *dest ++ = *source ++;
        case 1:
            *dest ++ = *source ++;
            break;
        default:
            throw IllegalUTF8();
        }
    }
    *dest = '\0';
    return dest;
}

（utf8Size中表的内容生成有点痛苦，但是如果你正在处理的话，这是一个你会经常使用的功能 UTF-8，你只需要做一次。）

Answer 2

我在许多具有多字节字符的样本UTF8字符串上进行了测试。如果源太长，它会对它进行反向搜索（从null终止符开始）并向后工作以查找可以放入目标缓冲区的最后一个完整UTF8字符。它始终确保目标为空终止。

char* utf8cpy(char* dst, const char* src, size_t sizeDest )
{
    if( sizeDest ){
        size_t sizeSrc = strlen(src); // number of bytes not including null
        while( sizeSrc >= sizeDest ){

            const char* lastByte = src + sizeSrc; // Initially, pointing to the null terminator.
            while( lastByte-- > src )
                if((*lastByte & 0xC0) != 0x80) // Found the initial byte of the (potentially) multi-byte character (or found null).
                    break;

            sizeSrc = lastByte - src;
        }
        memcpy(dst, src, sizeSrc);
        dst[sizeSrc] = '\0';
    }
    return dst;
}

Answer 3

strncpy()是一个糟糕的功能：

如果空间不足，结果字符串将不会被终止。
如果有足够的空间，剩下的就会充满NUL。如果目标字符串非常大，这可能会很痛苦。

即使字符保持在ASCII范围（0x7f及以下），结果字符串也不是您想要的。在UTF-8情况下，它可能不是以无效的UTF-8序列结束和结束。

最好的建议是避免strncpy()。

修改广告1）：

#include <stdio.h> #include <string.h> int main (void) { char buff [4]; strncpy (buff, "hello world!\n", sizeof buff ); printf("%s\n", buff ); return 0; }

同意，缓冲区不会超限。但结果仍然是不受欢迎的。 strncpy（）只解决了部分问题。这是误导和不必要的。

UPDATE（2012-10-31）：由于这是一个令人讨厌的问题，我决定破解我自己的版本，模仿丑陋的strncpy（）行为。返回值是复制的字符数，但是..

#include <stdio.h> #include <string.h> size_t utf8ncpy(char *dst, char *src, size_t todo); static int cnt_utf8(unsigned ch, size_t len); static int cnt_utf8(unsigned ch, size_t len) { if (!len) return 0; if ((ch & 0x80) == 0x00) return 1; else if ((ch & 0xe0) == 0xc0) return 2; else if ((ch & 0xf0) == 0xe0) return 3; else if ((ch & 0xf8) == 0xf0) return 4; else if ((ch & 0xfc) == 0xf8) return 5; else if ((ch & 0xfe) == 0xfc) return 6; else return -1; /* Default (Not in the spec) */ } size_t utf8ncpy(char *dst, char *src, size_t todo) { size_t done, idx, chunk, srclen; srclen = strlen(src); for(done=idx=0; idx < srclen; idx+=chunk) { int ret; for (chunk=0; done+chunk < todo; chunk++) { ret = cnt_utf8( src[idx+chunk], srclen - (idx+chunk) ); if (ret ==1) continue; /* Normal character: collect it into chunk */ if (ret < 0) continue; /* Bad stuff: treat as normal char */ if (ret ==0) break; /* EOF */ if (!chunk) chunk = ret;/* an UTF8 multibyte character */ else ret = 1; /* we allready collected a number (chunk) of normal characters */ break; } if (ret > 1 && done+chunk > todo) break; if (done+chunk > todo) chunk = todo - done; if (!chunk) break; memcpy( dst+done, src+idx, chunk); done += chunk; if (ret < 1) break; } /* This is part of the dreaded strncpy() behavior: ** pad the destination string with NULs ** upto its intended size */ if (done < todo) memset(dst+done, 0, todo-done); return done; } int main(void) { char *string = "Hell\xc3\xb6 \xf1\x82\x82\x82, world\xc2\xa1!"; char buffer[30]; unsigned result, len; for (len = sizeof buffer-1; len < sizeof buffer; len -=3) { result = utf8ncpy(buffer, string, len); /* remove the following line to get the REAL strncpy() behaviour */ buffer[result] = 0; printf("Chop @%u\n", len ); printf("Org:[%s]\n", string ); printf("Res:%u\n", result ); printf("New:[%s]\n", buffer ); } return 0; }

Answer 4

这是一个C ++解决方案：

u8string.h：

#ifndef U8STRING_H
#define U8STRING_H 1
#include <stddef.h>
#ifdef __cplusplus
extern "C" {
#endif

/**
 * Copies the first few characters of the UTF-8-encoded string pointed to by
 * \p src into \p dest_buf, as many UTF-8-encoded characters as can be written in
 * <code>dest_buf_len - 1</code> bytes or until the NUL terminator of the string
 * pointed to by \p str is reached.
 *
 * The string of bytes that are written into \p dest_buf is NUL terminated
 * if \p dest_buf_len is greater than 0.
 *
 * \returns \p dest_buf
 */
char * u8slbcpy(char *dest_buf, const char *src, size_t dest_buf_len);

#ifdef __cplusplus
}
#endif
#endif

u8slbcpy.cpp：

#include "u8string.h"

#include <cstring>
#include <utf8.h>

char * u8slbcpy(char *dest_buf, const char *src, size_t dest_buf_len)
{
    if (dest_buf_len <= 0) {
        return dest_buf;
    } else if (dest_buf_len == 1) {
        dest_buf[0] = '\0';
        return dest_buf;
    }

    size_t num_bytes_remaining = dest_buf_len - 1;
    utf8::unchecked::iterator<const char *> it(src);
    const char * prev_base = src;
    while (*it++ != '\0') {
        const char *base = it.base();
        ptrdiff_t diff = (base - prev_base);
        if (num_bytes_remaining < diff) {
            break;
        }
        num_bytes_remaining -= diff;
        prev_base = base;
    }

    size_t n = dest_buf_len - 1 - num_bytes_remaining;
    std::memmove(dest_buf, src, n);
    dest_buf[n] = '\0';

    return dest_buf;
}

函数u8slbcpy()有一个C接口，但它是用C ++实现的。我的实现使用仅标题UTF8-CPP library。

我认为这几乎就是你要找的东西，但请注意，如果组合字符适用于 n <仍然存在可能无法复制一个或多个组合字符的问题sup> th 字符（本身不是组合字符），目标缓冲区足够大，可以存储字符1到 n 的UTF-8编码，但不能存储字符的组合字符名词的。在这种情况下，写入表示字符1到 n 的字节，但 n 的组合字符都不是。实际上，您可以说 n ^th字符是部分写入的。

Answer 5

回答自己的问题，继续我最终得到的C函数（不使用C ++进行此项目）：

注意： - 意识到这不是utf8的strncpy的克隆，它更像是来自openbsd的strlcpy。 - 从glib的gutf8.c复制的utf8_skip_data - 它不验证utf8 - 这就是我的意图。

希望这对其他人有用并对反馈感兴趣，但请不要迂腐狂热者关于NULL终止行为，除非它是一个真正的错误，或误导/错误的行为。

感谢James Kanze为此提供了基础，但是不完整和C ++（我需要一个C版本）。

static const size_t utf8_skip_data[256] = {
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
    3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,6,6,1,1
};

char *strlcpy_utf8(char *dst, const char *src, size_t maxncpy)
{
    char *dst_r = dst;
    size_t utf8_size;

    if (maxncpy > 0) {
        while (*src != '\0' && (utf8_size = utf8_skip_data[*((unsigned char *)src)]) < maxncpy) {
            maxncpy -= utf8_size;
            switch (utf8_size) {
                case 6: *dst ++ = *src ++;
                case 5: *dst ++ = *src ++;
                case 4: *dst ++ = *src ++;
                case 3: *dst ++ = *src ++;
                case 2: *dst ++ = *src ++;
                case 1: *dst ++ = *src ++;
            }
        }
        *dst= '\0';
    }
    return dst_r;
}

Answer 6

对上述答案发表评论“strncpy（）是一个糟糕的功能：”。我讨厌甚至以创建另一个互联网节目jihad为代价评论这些一揽子声明，但无论如何，因为这样的陈述会误导那些可能来这里寻找答案的人。

好吧也许C字符串函数是“老派”。也许C / C ++中的所有字符串都应该放在某种智能容器等中，也许应该使用C ++而不是C（当你有选择时），这些更多是偏好和其他主题的参数。

我来到这里寻找一个UTF-8 strncpy（）我自己。并不是说我不能制作一个（编码是恕我直言，简单而优雅），但想看看其他人是如何制作他们的，也许是在ASM中找到优化的。

对于编程世界人士的“众神礼物”，暂时搁置你的傲慢并看一些事实。

“strncpy（）”或具有相同副作用的任何其他类似函数和“_snprintf（）”等问题没有任何问题。

我说：“strncpy（）并不可怕”，而是“可怕的程序员非常使用它”。

什么是“可怕的”不知道规则。此外，由于安全性（如缓冲区溢出）和程序稳定性的影响，在整个主题上，如果仅遵循规则，则不需要示例Microsoft添加到它的CRT库“安全字符串函数”。

主要的：

“sizeof（）”返回静态字符串w /终结符的长度。
“strlen（）”返回字符串w / o终止符的长度。
大多数情况下，如果没有所有“n”函数只是在没有添加终结符的情况下钳制到'n'。
对于需要和输入缓冲区大小的函数中的“缓冲区大小”，存在隐含的模糊性。 I.E. “（char * pszBuffer，int iBufferSize）”类型。更安全地假设最坏的情况并传递一个小于实际缓冲区大小的大小，并在末尾添加终结符以确保。
对于字符串输入，缓冲区等，根据预期的平均值和最大值设置和使用合理的大小限制。希望避免输入截断，并消除缓冲区溢出期。

这就是我个人处理这些事情的方式，以及其他只是要知道和实践的规则。

静态字符串大小的便捷宏：

// Size of a string with out terminator
#define SIZESTR(x) (sizeof(x) - 1)

声明本地/堆栈字符串缓冲区时：

A）例如，终结器的大小限制为1023 + 1，以允许字符串长度达到1023个字符。

B）我正在将字符串初始化为零长度，并在最后终止以覆盖可能的'n'截断。

char szBuffer[1024]; szBuffer[0] = szBuffer[SIZESTR(szBuffer)] = 0;

或者可以做到： char szBuffer[1024] = {0}; 当然，但是编译器生成的“memset（）就像调用零整个缓冲区一样有一些性能影响。它使调试更加清晰，我更喜欢这种样式用于静态（vs本地/堆栈）字符串缓冲区。” / p>

现在遵循规则的“strncpy（）”：

char szBuffer[1024]; szBuffer[0] = szBuffer[SIZESTR(szBuffer)] = 0; 
strncpy(szBuffer, pszSomeInput, SIZESTR(szBuffer));

当然还有其他“规则”和问题，但这些是我想到的主要问题。您只需了解lib函数的工作原理并使用此类安全实践。

最后在我的项目中我无论如何使用ICU所以我决定使用它并使用“utf8.h”中的宏来制作我自己的“strncpy（）”。

utf8意识到strncpy

注意：

6 个答案: