是否有可能"强迫"在C程序中使用UTF-8?

时间:2016-03-23 16:45:01

标签: c encoding utf-8 character-encoding

通常当我希望我的程序使用UTF-8编码时,我会写setlocale (LC_ALL, "");。但是今天我发现它只是设置了环境的默认语言环境,而且我不知道环境是否默认使用UTF-8。

我想知道有没有办法强制字符编码为UTF-8?另外,有没有办法检查我的程序是否使用UTF-8?

3 个答案:

答案 0 :(得分:2)

尝试:

setlocale(LC_ALL, "en_US.UTF-8");

您可以在终端中运行locale -a以获取系统支持的语言环境的完整列表(" en_US.UTF-8"应该得到大多数/所有UTF-8支持系统的支持)。

编辑1 (备用拼写)

在评论中,Lee指出某些系统有一个替代拼写"en_US.utf8"(这让我感到惊讶,但我们每天都在学习新东西)。

由于setlocale失败时返回NULL,您可以链接这些调用:

if(!setlocale(LC_ALL, "en_US.UTF-8") && !setlocale(LC_ALL, "en_US.utf8"))
   printf("failed to set locale to UTF-8");

编辑2 (了解我们是否使用UTF-8)

要确定区域设置是否设置为UFT-8(在尝试设置之后),您可以检查返回的值(NULL表示调用失败)或检查使用的区域设置。

选项1:

char * result;
if((result = setlocale (LC_ALL, "en_US.UTF-8")) == NULL)
   printf("failed to set locale to UTF-8");

选项2:

setlocale (LC_ALL, "en_US.UTF-8"); // set
char * result = setlocale (LC_ALL, NULL); // review
if(!strstr(result, "UTF-8"))
   printf("failed to set locale to UTF-8");

答案 1 :(得分:1)

这是可能的,但这是完全错误的事情。

首先,当前的语言环境由用户决定。它不仅包括字符集,还包括语言,日期和时间格式等。你的程序绝对没有"对"惹它。

如果您无法本地化您的程序,只需告诉用户您的程序具有的环境要求,并让他们担心。

实际上,您不应该真正依赖UTF-8作为当前编码,而是使用广泛的字符支持,包括wctype()mbstowcs()等功能。 POSIXy系统还在其C库中提供iconv_open()iconv()函数系列,以在编码之间进行转换(应始终包括与wchar_t之间的转换);在Windows上,您需要一个单独的版本libiconv库。这就是GCC编译器处理不同字符集的方法。 (在内部,它使用Unicode / UTF-8,但如果你要求它,它可以进行必要的转换以使用其他字符集。)

我个人是using UTF-8 everywhere的强烈支持者,但是在程序中覆盖用户区域设置是非常可怕的。可恶。令人反感;就像桌面小程序改变显示分辨率一样,因为程序员特别喜欢某个。

我很乐意写一些示例代码来说明如何正确解决任何符合字符集的情况,但有很多,我不知道从哪里开始。

如果OP修改他们的问题以确切地说明覆盖字符集应该解决的问题,我愿意展示如何使用上述实用程序和POSIX工具(或等效工具) Windows上免费提供的库)以正确解决它。

如果这对某人来说似乎很苛刻,那就是,但这只是因为在这里采取简单易行的方式(覆盖用户的语言环境设置)是...... 错误的,纯粹是技术理由。即使无操作更好,实际上也是可以接受的,只要您只记录您的应用程序只处理UTF-8输入/输出。

示例1.本地化新年快乐!

#include <stdlib.h>
#include <locale.h>
#include <stdio.h>
#include <wchar.h>

int main(void)
{
    /* We wish to use the user's current locale. */
    setlocale(LC_ALL, "");

    /* We intend to use wide functions on standard output. */
    fwide(stdout, 1);

    /* For Windows compatibility, print out a Byte Order Mark.
     * If you save the output to a file, this helps tell Windows
     * applications that the file is Unicode.
     * Other systems don't need it nor use it.
    */
    fputwc(L'\uFEFF', stdout);

    wprintf(L"Happy New Year!\n");
    wprintf(L"С новым годом!\n");
    wprintf(L"新年好!\n");
    wprintf(L"賀正!\n");
    wprintf(L"¡Feliz año nuevo!\n");
    wprintf(L"Hyvää uutta vuotta!\n");

    return EXIT_SUCCESS;
}

请注意,wprintf()采用宽字符串(宽字符串常量的格式为L"",宽字符常量L'',而不是普通/窄对应""和{{1 }})。格式仍然相同; ''打印正常/窄字符串,%s打印宽字符串。

示例2.从标准输入读取输入行,并可选择将它们保存到文件中。文件名在命令行中提供。

%ls

上面的#include <stdlib.h> #include <string.h> #include <locale.h> #include <wctype.h> #include <wchar.h> #include <errno.h> #include <stdio.h> typedef enum { TRIM_LEFT = 1, /* Remove leading whitespace and control characters */ TRIM_RIGHT = 2, /* Remove trailing whitespace and control characters */ TRIM_NEWLINE = 4, /* Remove newline at end of line */ TRIM = 7, /* Remove leading and trailing whitespace and control characters */ OMIT_NUL = 8, /* Skip NUL characters (embedded binary zeros, L'\0') */ OMIT_CONTROLS = 16, /* Skip control characters */ CLEANUP = 31, /* All of the above. */ COMBINE_LWS = 32, /* Combine all whitespace into a single space */ } trim_opts; /* Read an unlimited-length line from a wide input stream. * * This function takes a pointer to a wide string pointer, * pointer to the number of wide characters dynamically allocated for it, * the stream to read from, and a set of options on how to treat the line. * * If an error occurs, this will return 0 with errno set to nonzero error number. * Use strerror(errno) to obtain the error description (as a narrow string). * * If there is no more data to read from the stream, * this will return 0 with errno 0, and feof(stream) will return true. * * If an empty line is read, * this will return 0 with errno 0, but feof(stream) will return false. * * Typically, you initialize variables like * wchar_t *line = NULL; * size_t size = 0; * before calling this function, so that subsequent calls the same, dynamically * allocated buffer for the line, and it is automatically grown if necessary. * There are no built-in limits to line lengths this way. */ size_t getwline(wchar_t **const lineptr, size_t *const sizeptr, FILE *const in, trim_opts const trimming) { wchar_t *line; size_t size; size_t used = 0; wint_t wc; fpos_t startpos; int seekable; if (lineptr == NULL || sizeptr == NULL || in == NULL) { errno = EINVAL; return 0; } if (*lineptr != NULL) { line = *lineptr; size = *sizeptr; } else { line = NULL; size = 0; *sizeptr = 0; } /* In error cases, we can try and get back to this position * in the input stream, as we cannot really return the data * read thus far. However, some streams like pipes are not seekable, * so in those cases we should not even try. * Use (seekable) as a flag to remember if we should try. */ if (fgetpos(in, &startpos) == 0) seekable = 1; else seekable = 0; while (1) { /* When we read a wide character from a wide stream, * fgetwc() will return WEOF with errno set if an error occurs. * However, fgetwc() will return WEOF with errno *unchanged* * if there is no more input in the stream. * To detect which of the two happened, we need to clear errno * first. */ errno = 0; wc = fgetwc(in); if (wc == WEOF) { if (errno) { const int saved_errno = errno; if (seekable) fsetpos(in, &startpos); errno = saved_errno; return 0; } if (ferror(in)) { if (seekable) fsetpos(in, &startpos); errno = EIO; return 0; } break; } /* Dynamically grow line buffer if necessary. * We need room for the current wide character, * plus at least the end-of-string mark, L'\0'. */ if (used + 2 > size) { /* Size policy. This can be anything you see fit, * as long as it yields size >= used + 2. * * This one increments size to next multiple of * 1024 (minus 16). It works well in practice, * but do not think of it as the "best" way. * It is just a robust choice. */ size = (used | 1023) + 1009; line = realloc(line, size * sizeof line[0]); if (!line) { /* Memory allocation failed. */ if (seekable) fsetpos(in, &startpos); errno = ENOMEM; return 0; } *lineptr = line; *sizeptr = size; } /* Append character to buffer. */ if (!trimming) line[used++] = wc; else { /* Check if we have reasons to NOT add the character to buffer. */ do { /* Omit NUL if asked to. */ if (trimming & OMIT_NUL) if (wc == L'\0') break; /* Omit controls if asked to. */ if (trimming & OMIT_CONTROLS) if (iswcntrl(wc)) break; /* If we are at start of line, and we are left-trimming, * only graphs (printable non-whitespace characters) are added. */ if (trimming & TRIM_LEFT) if (wc == L'\0' || !iswgraph(wc)) break; /* Combine whitespaces if asked to. */ if (trimming & COMBINE_LWS) if (iswspace(wc)) { if (used > 0 && line[used-1] == L' ') break; else wc = L' '; } /* Okay, add the character to buffer. */ line[used++] = wc; } while (0); } /* End of the line? */ if (wc == L'\n') break; } /* The above loop will only end (break out) * if end of line or end of input was found, * and no error occurred. */ /* Trim right if asked to. */ if (trimming & TRIM_RIGHT) while (used > 0 && iswspace(line[used-1])) --used; else if (trimming & TRIM_NEWLINE) while (used > 0 && (line[used-1] == L'\r' || line[used-1] == L'\n')) --used; /* Ensure we have room for end-of-string L'\0'. */ if (used >= size) { size = used + 1; line = realloc(line, size * sizeof line[0]); if (!line) { if (seekable) fsetpos(in, &startpos); errno = ENOMEM; return 0; } *lineptr = line; *sizeptr = size; } /* Add end of string mark. */ line[used] = L'\0'; /* Successful return. */ errno = 0; return used; } /* Counts the number of wide characters in 'alpha' class. */ size_t count_letters(const wchar_t *ws) { size_t count = 0; if (ws) while (*ws != L'\0') if (iswalpha(*(ws++))) count++; return count; } int main(int argc, char *argv[]) { FILE *out; wchar_t *line = NULL; size_t size = 0; size_t len; setlocale(LC_ALL, ""); /* Standard input and output should use wide characters. */ fwide(stdin, 1); fwide(stdout, 1); /* Check if the user asked for help. */ if (argc < 2 || argc > 3 || strcmp(argv[1], "-h") == 0 || strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "/?") == 0) { fprintf(stderr, "\n"); fprintf(stderr, "Usage: %s [ -h | --help | /? ]\n", argv[0]); fprintf(stderr, " %s FILENAME [ PROMPT ]\n", argv[0]); fprintf(stderr, "\n"); fprintf(stderr, "The program will read input lines until an only '.' is supplied.\n"); fprintf(stderr, "If you do not want to save the output to a file,\n"); fprintf(stderr, "use '-' as the FILENAME.\n"); fprintf(stderr, "\n"); return EXIT_SUCCESS; } /* Open file for output, unless it is "-". */ if (strcmp(argv[1], "-") == 0) out = NULL; /* No output to file */ else { out = fopen(argv[1], "w"); if (out == NULL) { fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno)); return EXIT_FAILURE; } /* The output file is used with wide strings. */ fwide(out, 1); } while (1) { /* Prompt? Note: our prompt string is narrow, but stdout is wide. */ if (argc > 2) { wprintf(L"%s\n", argv[2]); fflush(stdout); } len = getwline(&line, &size, stdin, CLEANUP); if (len == 0) { if (errno) { fprintf(stderr, "Error reading standard input: %s.\n", strerror(errno)); break; } if (feof(stdin)) break; } /* The user does not wish to supply more lines? */ if (wcscmp(line, L".") == 0) break; /* Print the line to the file. */ if (out != NULL) { fputws(line, out); fputwc(L'\n', out); } /* Tell the user what we read. */ wprintf(L"Received %lu wide characters, %lu of which were letterlike.\n", (unsigned long)len, (unsigned long)count_letters(line)); fflush(stdout); } /* The line buffer is no longer needed, so we can discard it. * Note that free(NULL) is safe, so we do not need to check. */ free(line); /* I personally also like to reset the variables. * It helps with debugging, and to avoid reuse-after-free() errors. */ line = NULL; size = 0; return EXIT_SUCCESS; } 函数几乎是处理本地化宽字符支持时可能需要的函数最复杂的一部分。它允许您读取没有长度限制的本地化输入行,并可选择修剪和清除返回的字符串(删除控制代码和嵌入的二进制零)。它也适用于LF和CR-LF(getwline()\n)换行编码。

答案 2 :(得分:1)

这不是一个答案,而是关于如何使用宽字符I / O的第三个非常复杂的例子。这太长了,无法添加到我的actual answer to this question

此示例显示如何使用宽字符串读取和处理CSV文件(RFC-4180格式,可选择使用有限的反斜杠转义支持)。

以下代码是CC0 / public domain,因此您可以随意使用它,甚至可以包含在您自己的专有项目中,但如果它破坏了任何内容,您可以保留所有内容而不是向我抱怨。 (不过,如果您在下面的评论中找到并报告错误,我会很乐意提供任何错误修复。)

然而,代码的逻辑是健壮的。特别是,它支持通用换行符,所有四种常见的换行符类型:类Unix的LF(\n),旧的CR LF(\r\n),旧的Mac CR(\r),以及偶尔遇到奇怪的LF CR(\n\r)。 wrt没有内置限制。字段的长度,记录中的字段数或文件中的记录数。如果您需要转换CSV或处理CSV输入流(逐字段或逐个记录),而不必在一个内存中有多个内存,它的工作非常好。如果你想构造结构来描述内存中的记录和字段,你需要为它添加一些脚手架代码。

由于通用换行支持,当以交互方式读取输入时,此程序可能需要两个连续的输入结束( Ctrl + Z 在Windows和MS-DOS中, Ctrl + D 其他地方),因为第一个通常&#34;消费&#34;通过csv_next_field()csv_skip_field()函数,csv_next_record()函数需要再次重新读取它才能实际检测到它。但是,您通常不会要求用户以交互方式输入CSV数据,因此这应该是一个可接受的怪癖。

#include <stdlib.h>
#include <locale.h>
#include <string.h>
#include <stdio.h>
#include <wchar.h>
#include <wctype.h>
#include <errno.h>

/* RFC-4180 -format CSV file processing using wide input streams.
 *
 * #define BACKSLASH_ESCAPES if you additionally wish to have
 * \\, \a, \b, \t, \n, \v, \f, \r, \", and \, de-escaped to their
 * C string equivalents when reading CSV fields.
*/

typedef enum {
    CSV_OK = 0,
    CSV_END = 1,
    CSV_INVALID_PARAMETERS = -1,
    CSV_FORMAT_ERROR = -2,
    CSV_CHARSET_ERROR = -3,
    CSV_READ_ERROR = -4,
    CSV_OUT_OF_MEMORY = -5,
} csv_status;

const char *csv_error(const csv_status code)
{
    switch (code) {
    case CSV_OK:                 return "No error";
    case CSV_END:                return "At end";
    case CSV_INVALID_PARAMETERS: return "Invalid parameters";
    case CSV_FORMAT_ERROR:       return "Bad CSV format";
    case CSV_CHARSET_ERROR:      return "Illegal character in CSV file (incorrect locale?)";
    case CSV_READ_ERROR:         return "Read error";
    case CSV_OUT_OF_MEMORY:      return "Out of memory";
    default:                     return "Unknown csv_status code"; 
    }
}

/* Start the next record. Automatically skips any remaining fields in current record.
 * Returns CSV_OK if successful, CSV_END if no more records, or a negative CSV_ error code. */
csv_status csv_next_record (FILE *const in);

/* Skip the next field. Returns CSV_OK if successful, CSV_END if no more fields in current record,
 * or a negative CSV_ error code. */
csv_status csv_skip_field  (FILE *const in);

/* Read the next field. Returns CSV_OK if successful, CSV_END if no more fields in current record,
 * or a negative CSV_ error code.
 * If this returns CSV_OK, then *dataptr is a dynamically allocated wide string to the field
 * contents, space allocated for *sizeptr wide characters; and if lengthptr is not NULL, then
 * *lengthptr is the number of wide characters in said wide string. */
csv_status csv_next_field  (FILE *const in, wchar_t **const dataptr,
                                            size_t   *const sizeptr,
                                            size_t   *const lengthptr);

static csv_status internal_skip_quoted(FILE *const in)
{
    while (1) {
        wint_t  wc;

        errno = 0;
        wc = fgetwc(in);

        if (wc == WEOF) {
            if (errno == EILSEQ)
                return CSV_CHARSET_ERROR;
            if (errno)
                return CSV_READ_ERROR;
            if (ferror(in)) {
                errno = EIO;
                return CSV_READ_ERROR;
            }
            errno = 0;
            return CSV_FORMAT_ERROR;
        }

        if (wc == L'"') {
            errno = 0;
            wc = fgetwc(in);            

            if (wc == L'"')
                continue;

            while (wc != WEOF && wc != L'\n' && wc != L'\r' && iswspace(wc)) {
                errno = 0;
                wc = fgetwc(in);
            }

            if (wc == WEOF) {
                if (errno == EILSEQ)
                    return CSV_CHARSET_ERROR;
                if (errno)
                    return CSV_READ_ERROR;
                if (ferror(in)) {
                    errno = EIO;
                    return CSV_READ_ERROR;
                }
                errno = 0;
                return CSV_END;
            }

            if (wc == L',') {
                errno = 0;
                return CSV_OK;
            }

            if (wc == L'\n' || wc == L'\r') {
                ungetwc(wc, in);
                errno = 0;
                return CSV_END;
            }

            ungetwc(wc, in);
            errno = 0;
            return CSV_FORMAT_ERROR;
        }

#ifdef BACKSLASH_ESCAPES
        if (wc == L'\\') {
            errno = 0;
            wc = fgetwc(in);

            if (wc == L'"')
                continue;

            if (wc == WEOF) {
                if (errno == EILSEQ)
                    return CSV_CHARSET_ERROR;
                if (errno)
                    return CSV_READ_ERROR;
                if (ferror(in)) {
                    errno = EIO;
                    return CSV_READ_ERROR;
                }
                errno = 0;
                return CSV_END;
            }
        }
#endif
    }
}

static csv_status internal_skip_unquoted(FILE *const in, wint_t wc)
{
    while (1) {

        if (wc == WEOF) {
            if (errno == EILSEQ)
                return CSV_CHARSET_ERROR;
            if (errno)
                return CSV_READ_ERROR;
            if (ferror(in)) {
                errno = EIO;
                return CSV_READ_ERROR;
            }
            errno = 0;
            return CSV_END;
        }

        if (wc == L',') {
            errno = 0;
            return CSV_OK;
        }

        if (wc == L'\n' || wc == L'\r') {
            ungetwc(wc, in);
            errno = 0;
            return CSV_END;
        }

#ifdef BACKSLASH_ESCAPES
        if (wc == L'\\') {
            errno = 0;
            wc = fgetwc(in);
            if (wc == WEOF) {
                if (errno == EILSEQ)
                    return CSV_CHARSET_ERROR;
                if (errno)
                    return CSV_READ_ERROR;
                if (ferror(in)) {
                    errno = EIO;
                    return CSV_READ_ERROR;
                }
                errno = 0;
                return CSV_END;
            }
        }
#endif

        errno = 0;
        wc = fgetwc(in);
    }
}

csv_status csv_next_record(FILE *const in)
{
    while (1) {
        wint_t      wc;
        csv_status  status;

        do {
            errno = 0;
            wc = fgetwc(in);
        } while (wc != WEOF && wc != L'\n' && wc != L'\r' && iswspace(wc));

        if (wc == WEOF) {
            if (errno == EILSEQ)
                return CSV_CHARSET_ERROR;
            if (errno)
                return CSV_READ_ERROR;
            if (ferror(in)) {
                errno = EIO;
                return CSV_READ_ERROR;
            }
            errno = 0;
            return CSV_END;
        }

        if (wc == L'\n' || wc == L'\r') {
            wint_t next_wc;

            errno = 0;
            next_wc = fgetwc(in);

            if (next_wc == WEOF) {
                if (errno == EILSEQ)
                    return CSV_CHARSET_ERROR;
                if (errno)
                    return CSV_READ_ERROR;
                if (ferror(in)) {
                    errno = EIO;
                    return CSV_READ_ERROR;
                }
                errno = 0;
                return CSV_END;
            }

            if ((wc == L'\n' && next_wc == L'\r') ||
                (wc == L'\r' && next_wc == L'\n')) {
                errno = 0;
                return CSV_OK;
            }

            ungetwc(next_wc, in);
            errno = 0;
            return CSV_OK;
        }

        if (wc == L'"')
            status = internal_skip_quoted(in);
        else
            status = internal_skip_unquoted(in, wc);

        if (status < 0)
            return status;
    }
}

csv_status csv_skip_field(FILE *const in)
{
    wint_t  wc;

    if (!in) {
        errno = EINVAL;
        return CSV_INVALID_PARAMETERS;
    } else
    if (ferror(in)) {
        errno = EIO;
        return CSV_READ_ERROR;
    }

    /* Skip leading whitespace. */
    do {
        errno = 0;
        wc = fgetwc(in);
    } while (wc != WEOF && wc != L'\n' && wc != L'\r' && iswspace(wc));

    if (wc == L'"')
        return internal_skip_quoted(in);
    else
        return internal_skip_unquoted(in, wc);

}        

csv_status csv_next_field(FILE *const in, wchar_t **const dataptr,
                                          size_t   *const sizeptr,
                                          size_t   *const lengthptr)
{
    wchar_t *data;
    size_t   size;
    size_t   used = 0; /* length */
    wint_t   wc;

    if (lengthptr)
        *lengthptr = 0;

    if (!in || !dataptr || !sizeptr) {
        errno = EINVAL;
        return CSV_INVALID_PARAMETERS;
    } else
    if (ferror(in)) {
        errno = EIO;
        return CSV_READ_ERROR;
    }

    if (*dataptr) {
        data = *dataptr;
        size = *sizeptr;
    } else {
        data = NULL;
        size = 0;
        *sizeptr = 0;
    }

    /* Skip leading whitespace. */
    do {
        errno = 0;
        wc = fgetwc(in);
    } while (wc != WEOF && wc != L'\n' && wc != L'\r' && iswspace(wc));

    if (wc == WEOF) {
        if (errno == EILSEQ)
            return CSV_CHARSET_ERROR;
        if (errno)
            return CSV_READ_ERROR;
        if (ferror(in)) {
            errno = EIO;
            return CSV_READ_ERROR;
        }
        errno = 0;
        return CSV_END;
    }

    if (wc == L'\n' || wc == L'\r') {
        ungetwc(wc, in);
        errno = 0;
        return CSV_END;
    }

    if (wc == L'"')
        while (1) {

            errno = 0;
            wc = getwc(in);

            if (wc == WEOF) {
                if (errno == EILSEQ)
                    return CSV_CHARSET_ERROR;
                if (errno)
                    return CSV_READ_ERROR;
                if (ferror(in)) {
                    errno = EIO;
                    return CSV_READ_ERROR;
                }
                errno = 0;
                return CSV_FORMAT_ERROR;

            } else
            if (wc == L'"') {
                errno = 0;
                wc = getwc(in);

                if (wc != L'"') {
                    /* Not an escaped doublequote. */

                    while (wc != WEOF && wc != L'\n' && wc != L'\r' && iswspace(wc)) {
                        errno = 0;
                        wc = getwc(in);
                    }

                    if (wc == WEOF) {
                        if (errno == EILSEQ)
                            return CSV_CHARSET_ERROR;
                        if (errno)
                            return CSV_READ_ERROR;
                        if (ferror(in)) {
                            errno = EIO;
                            return CSV_READ_ERROR;
                        }
                    } else
                    if (wc == L'\n' || wc == L'\r') {
                        ungetwc(wc, in);
                    } else
                    if (wc != L',') {
                        errno = 0;
                        return CSV_FORMAT_ERROR;
                    }
                    break;
                }

#ifdef BACKSLASH_ESCAPES
            } else
            if (wc == L'\\') {
                errno = 0;
                wc = getwc(in);

                if (wc == L'\0')
                    continue;
                else
                if (wc == WEOF) {
                    if (errno == EILSEQ)
                        return CSV_CHARSET_ERROR;
                    if (errno)
                        return CSV_READ_ERROR;
                    if (ferror(in)) {
                        errno = EIO;
                        return CSV_READ_ERROR;
                    }
                    break;
                } else
                    switch (wc) {
                    case L'a':  wc = L'\a'; break;
                    case L'b':  wc = L'\b'; break;
                    case L't':  wc = L'\t'; break;
                    case L'n':  wc = L'\n'; break;
                    case L'v':  wc = L'\v'; break;
                    case L'f':  wc = L'\f'; break;
                    case L'r':  wc = L'\r'; break;
                    case L'\\': wc = L'\\'; break;
                    case L'"':  wc = L'"';  break;
                    case L',':  wc = L',';  break;
                    default:
                        ungetwc(wc, in);
                        wc = L'\\';
                    }
#endif
            }

            if (used + 2 > size) {
                /* Allocation policy.
                 * Anything that yields size >= used + 2 is acceptable.
                 * This one allocates in roughly 1024 byte chunks,
                 * and is known to be robust (but not optimal) in practice. */
                size = (used | 1023) + 1009;
                data = realloc(data, size * sizeof data[0]);
                if (!data) {
                    errno = ENOMEM;
                    return CSV_OUT_OF_MEMORY;
                }
                *dataptr = data;
                *sizeptr = size;
            }

            data[used++] = wc;
        }
    else
        while (1) {

            if (wc == L',')
                break;

            if (wc == L'\n' || wc == L'\r') {
                ungetwc(wc, in);
                break;
            }

#ifdef BACKSLASH_ESCAPES
            if (wc == L'\\') {
                errno = 0;
                wc = fgetwc(in);
                if (wc == WEOF) {
                    if (errno == EILSEQ)
                        return CSV_CHARSET_ERROR;
                    if (errno)
                        return CSV_READ_ERROR;
                    if (ferror(in)) {
                        errno = EIO;
                        return CSV_READ_ERROR;
                    }
                    wc = L'\\';
                } else
                    switch (wc) {
                    case L'a':  wc = L'\a'; break;
                    case L'b':  wc = L'\b'; break;
                    case L't':  wc = L'\t'; break;
                    case L'n':  wc = L'\n'; break;
                    case L'v':  wc = L'\v'; break;
                    case L'f':  wc = L'\f'; break;
                    case L'r':  wc = L'\r'; break;
                    case L'"':  wc = L'"';  break;
                    case L',':  wc = L',';  break;
                    case L'\\': wc = L'\\'; break;
                    default:
                        ungetwc(wc, in);
                        wc = L'\\';
                    }
            }
#endif

            if (used + 2 > size) {
                /* Allocation policy.
                 * Anything that yields size >= used + 2 is acceptable.
                 * This one allocates in roughly 1024 byte chunks,
                 * and is known to be robust (but not optimal) in practice. */
                size = (used | 1023) + 1009;
                data = realloc(data, size * sizeof data[0]);
                if (!data) {
                    errno = ENOMEM;
                    return CSV_OUT_OF_MEMORY;
                }
                *dataptr = data;
                *sizeptr = size;
            }

            data[used++] = wc;

            errno = 0;
            wc = getwc(in);

            if (wc == WEOF) {
                if (errno == EILSEQ)
                    return CSV_CHARSET_ERROR;
                if (errno)
                    return CSV_READ_ERROR;
                if (ferror(in)) {
                    errno = EIO;
                    return CSV_READ_ERROR;
                }
                break;
            }
        }

    /* Ensure there is room for the end-of-string mark. */
    if (used >= size) {
        size = used + 1;
        data = realloc(data, size * sizeof data[0]);
        if (!data) {
            errno = ENOMEM;
            return CSV_OUT_OF_MEMORY;
        }
        *dataptr = data;
        *sizeptr = size;
    }

    data[used] = L'\0';

    if (lengthptr)
        *lengthptr = used;

    errno = 0;
    return CSV_OK;
}

/* Helper function: print a wide string as if in quotes, but backslash-escape special characters.
*/
static void wquoted(FILE *const out, const wchar_t *ws, const size_t len)
{
    if (out) {
        size_t i;

        for (i = 0; i < len; i++)
            if (ws[i] == L'\0')
                fputws(L"\\0", out);
            else
            if (ws[i] == L'\a')
                fputws(L"\\a", out);
            else
            if (ws[i] == L'\b')
                fputws(L"\\b", out);
            else
            if (ws[i] == L'\t')
                fputws(L"\\t", out);
            else
            if (ws[i] == L'\n')
                fputws(L"\\n", out);
            else
            if (ws[i] == L'\v')
                fputws(L"\\v", out);
            else
            if (ws[i] == L'\f')
                fputws(L"\\f", out);
            else
            if (ws[i] == L'\r')
                fputws(L"\\r", out);
            else
            if (ws[i] == L'"')
                fputws(L"\\\"", out);
            else
            if (ws[i] == L'\\')
                fputws(L"\\\\", out);
            else
            if (iswprint(ws[i])) 
                fputwc(ws[i], out);
            else
            if (ws[i] < 65535)
                fwprintf(out, L"\\x%04x", (unsigned int)ws[i]);
            else
                fwprintf(out, L"\\x%08x", (unsigned long)ws[i]);
    }
}


static int show_csv(FILE *const in, const char *const filename)
{
    wchar_t        *field_contents = NULL;
    size_t          field_allocated = 0;
    size_t          field_length = 0;
    unsigned long   record = 0UL;
    unsigned long   field;
    csv_status      status;

    while (1) {

        /* First field in this record. */
        field = 0UL;
        record++;

        while (1) {

            status = csv_next_field(in, &field_contents, &field_allocated, &field_length);

            if (status == CSV_END)
                break;

            if (status < 0) {
                fprintf(stderr, "%s: %s.\n", filename, csv_error(status));
                free(field_contents);
                return -1;
            }

            field++;

            wprintf(L"Record %lu, field %lu is \"", record, field);
            wquoted(stdout, field_contents, field_length);
            wprintf(L"\", %lu characters.\n", (unsigned long)field_length);
        }

        status = csv_next_record(in);

        if (status == CSV_END) {
            free(field_contents);
            return 0;
        }

        if (status < 0) {
            fprintf(stderr, "%s: %s.\n", filename, csv_error(status));
            free(field_contents);
            return -1;
        }
    }
}

static int usage(const char *argv0)
{
    fprintf(stderr, "\n");
    fprintf(stderr, "Usage: %s [ -h | --help | /? ]\n", argv0);
    fprintf(stderr, "       %s CSV-FILE [ ... ]\n", argv0);
    fprintf(stderr, "\n");
    fprintf(stderr, "Use special file name '-' to read from standard input.\n");
    fprintf(stderr, "\n");
    return EXIT_SUCCESS;
}

int main(int argc, char *argv[])
{
    FILE *in;
    int   arg;

    setlocale(LC_ALL, "");

    fwide(stdin, 1);
    fwide(stdout, 1);

    if (argc < 1)
        return usage(argv[0]);

    for (arg = 1; arg < argc; arg++) {

        if (!strcmp(argv[arg], "-h") || !strcmp(argv[arg], "--help") || !strcmp(argv[arg], "/?"))
            return usage(argv[0]);

        if (!strcmp(argv[arg], "-")) {

            if (show_csv(stdin, "(standard input)"))
                return EXIT_FAILURE;

        } else {

            in = fopen(argv[arg], "r");
            if (!in) {
                fprintf(stderr, "%s: %s.\n", argv[arg], strerror(errno));
                return EXIT_FAILURE;
            }

            if (show_csv(in, argv[arg]))
                return EXIT_FAILURE;
            if (ferror(in)) {
                fprintf(stderr, "%s: %s.\n", argv[arg], strerror(EIO));
                fclose(in);
                return EXIT_FAILURE;
            }
            if (fclose(in)) {
                fprintf(stderr, "%s: %s.\n", argv[arg], strerror(EIO));
                return EXIT_FAILURE;
            }
        }
    }

    return EXIT_SUCCESS;
}

使用上述csv_next_field()csv_skip_field()csv_next_record()非常简单。

  1. 正常打开CSV文件,然后在其上调用fwide(stream, 1)告诉C库您打算使用宽字符串变体而不是标准的窄字符串I / O函数。

    < / LI>
  2. 创建四个变量,并初始化前两个:

     wchar_t   *field = NULL;
     size_t     allocated = 0;
     size_t     length;
     csv_status status;
    

    field是指向您读取的每个字段的动态分配内容的指针。它是自动分配的;基本上,你根本不需要担心它。 allocated保留当前分配的大小(宽字符,包括终止L'\0'),我们稍后会使用lengthstatus

  3. 此时,您已准备好阅读或跳过第一条记录中的第一个字段。

    此时您不希望拨打csv_next_record(),除非您希望完全跳过文件中的第一条记录。

  4. 致电status = csv_skip_field(stream);跳过下一个字段,或status = csv_next_field(stream, &field, &allocated, &length);进行阅读。

    如果status == CSV_OK,则字段内容为明智字符串field。它中有length个宽字符。

    如果status == CSV_END,则当前记录中没有其他字段。 (field没有变化,你不应该检查它。)

    否则,status < 0,它描述了错误代码。您可以使用csv_error(status)来获取描述它的(窄)字符串。

  5. 在任何时候,您都可以通过拨打status = csv_next_record(stream);来移动(跳过)到下一条记录的开头。

    如果它返回CSV_OK,则可能有新记录。 (我们只知道你何时尝试阅读或跳过第一个字段。这类似于标准C库函数feof()仅告诉你是否已经尝试读取输入结束,它不会告诉你是否存在是否有更多数据可供使用。)

    如果它返回CSV_END,您已经处理了最后一条记录,并且没有更多记录。

    否则,它会返回一个负的错误代码status < 0。您可以使用csv_error(status)来获取描述它的(窄)字符串。

  6. 完成后,丢弃字段缓冲区:

     free(field);
     field = NULL;
     allocated = 0;
    

    您实际上不需要将变量重置为NULL并且为零,但我推荐它。实际上,您可以随时执行上述操作(当您不再对当前字段的内容感兴趣时),因为csv_next_field()将根据需要自动分配新缓冲区。

    请注意,free(NULL);始终是安全的,不执行任何操作。在释放field之前,您无需检查NULL是否为{{1}}。这也是我建议在声明变量时立即初始化变量的原因。它只是让一切变得更容易处理。

  7. 已编译的示例程序将一个或多个CSV文件名作为命令行参数,然后读取文件并报告文件中每个字段的内容。如果你有一个特别极其复杂的CSV文件,这对于检查这种方法是否正确读取所有字段是最佳的。