使用C ++中的UTF-16编码文本截断读取

时间:2016-09-12 00:02:17

标签: c++ c++11 encoding utf-8 utf-16

我的目标是将外部输入源转换为通用的UTF-8内部编码,因为它与我使用的许多库(例如RE2)兼容且紧凑。由于除了使用纯ASCII之外我不需要进行字符串切片,因此UTF-8对我来说是一种理想的格式。现在,我应该能够解码的外部输入格式包括UTF-16。

为了测试C ++中的UTF-16(big-endian或little-endian)读取,我将测试的UTF-8文件转换为UTF-16 LE和UTF-16 BE。该文件是CSV格式的简单乱码,有许多不同的源语言(英语,法语,日语,韩语,阿拉伯语,表情符号,泰语),以创建一个相当复杂的文件:

"This","佐藤 幹夫","Mêmes","친구"
"ภควา"," كيبورد للكتابة بالعربي","ウゥキュ,",""

UTF-8示例

现在,用以下代码解析用UTF-8编码的这个文件会产生预期的输出(我知道这个例子主要是人为的,因为我的系统编码是UTF-8,因此没有实际转换为宽字符然后返回to bytes是必需的):

#include <sstream>
#include <locale>
#include <iostream>
#include <fstream>
#include <codecvt>

std::wstring readFile(const char* filename)
{
    std::wifstream wif(filename, std::ios::binary);
    wif.imbue(std::locale(wif.getloc(), new std::codecvt_utf8<wchar_t, 0x10ffff>));
    std::wstringstream wss;
    wss << wif.rdbuf();
    return wss.str();
}


int main()
{
    std::wstring read = readFile("utf-8.csv");
    std::cout << read.size() << std::endl;

    using convert_type = std::codecvt_utf8<wchar_t>;
    std::wstring_convert<convert_type, wchar_t> converter;
    std::string converted_str = converter.to_bytes( read );
    std::cout << converted_str;

    return 0;
}

编译并运行文件时(在Linux上,系统编码为UTF-8),我得到以下输出:

$ g++ utf8.cpp -o utf8 -std=c++14
$ ./utf8
73
"This","佐藤 幹夫","Mêmes","친구"
"ภควา"," كيبورد للكتابة بالعربي","ウゥキュ,",""

UTF-16示例

但是,当我尝试使用UTF-16的类似示例时,尽管在文本编辑器,Python等中正确加载了文件,但我得到了一个截断的文件。

#include <fstream>
#include <sstream>
#include <iostream>
#include <locale>
#include <codecvt>
#include <string>


std::wstring readFile(const char* filename)
{
    std::wifstream wif(filename, std::ios::binary);
    wif.imbue(std::locale(wif.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff>));
    std::wstringstream wss;
    wss << wif.rdbuf();
    return wss.str();
}


int main()
{
    std::wstring read = readFile("utf-16.csv");
    std::cout << read.size() << std::endl;

    using convert_type = std::codecvt_utf8<wchar_t>;
    std::wstring_convert<convert_type, wchar_t> converter;
    std::string converted_str = converter.to_bytes( read );
    std::cout << converted_str;

    return 0;
}

编译并运行文件时(在Linux上,因此系统编码为UTF-8),我得到以下小端格式的输出:

$ g++ utf16.cpp -o utf16 -std=c++14
$ ./utf16
19
"This","PO

对于big-endian格式,我得到以下内容:

$ g++ utf16.cpp -o utf16 -std=c++14
$ ./utf16
19
"This","OP

有趣的是,CJK字符应该是Basic Multilingual Plane的一部分,但显然没有正确转换,文件会被提前截断。使用逐行方法会出现同样的问题。

其他资源

我之前检查了以下资源,最值得注意的是此answer以及此answer。他们的解决方案都没有为我证明是富有成效的。

其他细节

LANG = en_US.UTF-8
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.2)

任何其他细节,我很乐意提供。谢谢。

EDITS

Adrian在评论中提到我应该提供一个hexdump,它显示为&#34; utf-16le&#34;,小端UTF-16编码文件:

0000000 0022 0054 0068 0069 0073 0022 002c 0022
0000010 4f50 85e4 0020 5e79 592b 0022 002c 0022
0000020 004d 00ea 006d 0065 0073 0022 002c 0022
0000030 ce5c ad6c 0022 000a 0022 0e20 0e04 0e27
0000040 0e32 0022 002c 0022 0020 0643 064a 0628
0000050 0648 0631 062f 0020 0644 0644 0643 062a
0000060 0627 0628 0629 0020 0628 0627 0644 0639
0000070 0631 0628 064a 0022 002c 0022 30a6 30a5
0000080 30ad 30e5 002c 0022 002c 0022 d83d dec2
0000090 0022 000a                              
0000094

qexyn提到删除std::ios::binary标志,我尝试但没有改变任何内容。

最后,我尝试使用iconv来查看这些是否是有效文件,同时使用命令行实用程序和C模块。

$ iconv -f="UTF-16BE" -t="UTF-8" utf-16be.csv "This","佐藤 幹夫","Mêmes","친구" "ภควา"," كيبورد للكتابة بالعربي","ウゥキュ,",""

显然,iconv对源文件没有任何问题。这导致我使用iconv,因为它是跨平台的,易于使用且经过充分测试的,但如果有人对标准库有答案,我很乐意接受它。

1 个答案:

答案 0 :(得分:0)

所以我仍在等待使用C ++标准库的潜在答案,但我没有取得任何成功,因此我编写了一个与Boost和iconv一起使用的实现(这是相当常见的依赖项)。它由一个头文件和一个源文件组成,工作将满足上述所有情况,性能相当,可以接受任何iconv编码对,并包装流对象以便于集成到现有代码中。由于我是C ++的新手,如果您选择自己实现,我会测试代码:我远非专家。

<强> encoding.hpp

#pragma once

#include <iostream>

#if defined(_MSC_VER) && (_MSC_VER >= 1020)
# pragma once
#endif

#include <cassert>
#include <iosfwd>            // streamsize.
#include <memory>            // allocator, bad_alloc.
#include <new>
#include <string>
#include <boost/config.hpp>
#include <boost/cstdint.hpp>
#include <boost/detail/workaround.hpp>
#include <boost/iostreams/constants.hpp>
#include <boost/iostreams/detail/config/auto_link.hpp>
#include <boost/iostreams/detail/config/dyn_link.hpp>
#include <boost/iostreams/detail/config/wide_streams.hpp>
#include <boost/iostreams/detail/config/zlib.hpp>
#include <boost/iostreams/detail/ios.hpp>
#include <boost/iostreams/filter/symmetric.hpp>
#include <boost/iostreams/pipeline.hpp>
#include <boost/type_traits/is_same.hpp>
#include <boost/iostreams/filter/zlib.hpp>
#include <iconv.h>

// Must come last.
#ifdef BOOST_MSVC
#   pragma warning(push)
#   pragma warning(disable:4251 4231 4660)     // Dependencies not exported.
#endif
#include <boost/config/abi_prefix.hpp>
#undef small


namespace boost
{
namespace iostreams
{
// CONSTANTS
// ---------

extern const size_t maxUnicodeWidth;

// OBJECTS
// -------


/** @brief Parameters for input and output encodings to pass to iconv.
 */
struct encoded_params {
    std::string input;
    std::string output;

    encoded_params(const std::string &input = "UTF-8",
                   const std::string &output = "UTF-8"):
        input(input),
        output(output)
    {}
};


namespace detail
{
// DETAILS
// -------


/** @brief Base class for the character set conversion filter.
 *  Contains a core process function which converts the source
 *  encoding to the destination encoding.
 */
class BOOST_IOSTREAMS_DECL encoded_base {
public:
    typedef char char_type;
protected:
    encoded_base(const encoded_params & params = encoded_params());

    ~encoded_base();

    int convert(const char * & src_begin,
                const char * & src_end,
                char * & dest_begin,
                char * & dest_end);

    int copy(const char * & src_begin,
                const char * & src_end,
                char * & dest_begin,
                char * & dest_end);

    int process(const char * & src_begin,
                const char * & src_end,
                char * & dest_begin,
                char * & dest_end,
                int /* flushLevel */);

public:
    int total_in();
    int total_out();


private:
    iconv_t conv;
    bool differentCharset;
};


/** @brief Template implementation for the encoded writer.
 *
 *  Model of a C-style file filter for character set conversions, via
 *  iconv.
 */
template<typename Alloc = std::allocator<char> >
class encoded_writer_impl : public encoded_base {
public:
    encoded_writer_impl(const encoded_params &params = encoded_params());
    ~encoded_writer_impl();
    bool filter(const char*& src_begin, const char* src_end,
                char*& dest_begin, char* dest_end, bool flush);
    void close();
};


/** @brief Template implementation for the encoded reader.
 *
 *  Model of a C-style file filter for character set conversions, via
 *  iconv.
 */
template<typename Alloc = std::allocator<char> >
class encoded_reader_impl : public encoded_base {
public:
    encoded_reader_impl(const encoded_params &params = encoded_params());
    ~encoded_reader_impl();
    bool filter(const char*& begin_in, const char* end_in,
                char*& begin_out, char* end_out, bool flush);
    void close();
    bool eof() const
    {
        return eof_;
    }

private:
    bool eof_;
};



}   /* detail */

// FILTERS
// -------

/** @brief Model of InputFilter and OutputFilter implementing
 *  character set conversion via iconv.
 */
template<typename Alloc = std::allocator<char> >
struct basic_encoded_writer
    : symmetric_filter<detail::encoded_writer_impl<Alloc>, Alloc>
{
private:
    typedef detail::encoded_writer_impl<Alloc>         impl_type;
    typedef symmetric_filter<impl_type, Alloc>  base_type;
public:
    typedef typename base_type::char_type               char_type;
    typedef typename base_type::category                category;
    basic_encoded_writer(const encoded_params &params = encoded_params(),
                         int buffer_size = default_device_buffer_size);
    int total_in() { return this->filter().total_in(); }
};
BOOST_IOSTREAMS_PIPABLE(basic_encoded_writer, 1)

typedef basic_encoded_writer<> encoded_writer;


/** @brief Model of InputFilter and OutputFilter implementing
 *  character set conversion via iconv.
 */
template<typename Alloc = std::allocator<char> >
struct basic_encoded_reader
    : symmetric_filter<detail::encoded_reader_impl<Alloc>, Alloc>
{
private:
    typedef detail::encoded_reader_impl<Alloc>       impl_type;
    typedef symmetric_filter<impl_type, Alloc>  base_type;
public:
    typedef typename base_type::char_type               char_type;
    typedef typename base_type::category                category;
    basic_encoded_reader(const encoded_params &params = encoded_params(),
                         int buffer_size = default_device_buffer_size);
    int total_out() { return this->filter().total_out(); }
    bool eof() { return this->filter().eof(); }
};
BOOST_IOSTREAMS_PIPABLE(basic_encoded_reader, 1)

typedef basic_encoded_reader<> encoded_reader;


namespace detail
{
// IMPLEMENTATION
// --------------


/** @brief Initialize the encoded writer with the iconv parameters.
 */
template<typename Alloc>
encoded_writer_impl<Alloc>::encoded_writer_impl(const encoded_params& p):
    encoded_base(p)
{}


/** @brief Close the encoded writer.
 */
template<typename Alloc>
encoded_writer_impl<Alloc>::~encoded_writer_impl()
{}


/** @brief Implementation of the symmetric, character set encoding filter
 *  for the writer.
 */
template<typename Alloc>
bool encoded_writer_impl<Alloc>::filter
    (const char*& src_begin, const char* src_end,
     char*& dest_begin, char* dest_end, bool flush)
{
    int result = process(src_begin, src_end, dest_begin, dest_end, flush);
    return result == -1;
}


/** @brief Close the encoded writer.
 */
template<typename Alloc>
void encoded_writer_impl<Alloc>::close()
{}


/** @brief Close the encoded reader.
 */
template<typename Alloc>
encoded_reader_impl<Alloc>::~encoded_reader_impl()
{}


/** @brief Initialize the encoded reader with the iconv parameters.
 */
template<typename Alloc>
encoded_reader_impl<Alloc>::encoded_reader_impl(const encoded_params& p):
    encoded_base(p),
    eof_(false)
{}


/** @brief Implementation of the symmetric, character set encoding filter
 *  for the reader.
 */
template<typename Alloc>
bool encoded_reader_impl<Alloc>::filter
    (const char*& src_begin, const char* src_end,
    char*& dest_begin, char* dest_end, bool /* flush */)
{
    int result = process(src_begin, src_end, dest_begin, dest_end, true);
    return result;
}


/** @brief Close the encoded reader.
 */
template<typename Alloc>
void encoded_reader_impl<Alloc>::close()
{
    // cannot re-open, not a true stream
    //eof_ = false;
    //reset(false, true);
}

}   /* detail */


/** @brief Initializer for the symmetric write filter, which initializes
 *  the iconv base from the parameters and the buffer size.
 */
template<typename Alloc>
basic_encoded_writer<Alloc>::basic_encoded_writer
(const encoded_params& p, int buffer_size):
    base_type(buffer_size, p)
{}


/** @brief Initializer for the symmetric read filter, which initializes
 *  the iconv base from the parameters and the buffer size.
 */
template<typename Alloc>
basic_encoded_reader<Alloc>::basic_encoded_reader(const encoded_params &p, int buffer_size):
    base_type(buffer_size, p)
{}


}   /* iostreams */
}   /* boost */

#include <boost/config/abi_suffix.hpp> // Pops abi_suffix.hpp pragmas.
#ifdef BOOST_MSVC
    # pragma warning(pop)
#endif

<强> encoding.cpp

#include "encoding.hpp"

#include <iconv.h>

#include <algorithm>
#include <cstring>
#include <string>


namespace boost
{
namespace iostreams
{
namespace detail
{
// CONSTANTS
// ---------

const size_t maxUnicodeWidth = 4;

// DETAILS
// -------


/** @brief Initialize the iconv converter with the source and
 *  destination encoding.
 */
encoded_base::encoded_base(const encoded_params &params)
{
    if (params.output != params.input) {
        conv = iconv_open(params.output.data(), params.input.data());
        differentCharset = true;
    } else {
        differentCharset = false;
    }
}


/** @brief Cleanup the iconv converter.
 */
encoded_base::~encoded_base()
{
    if (differentCharset) {
        iconv_close(conv);
    }
}


/** C-style stream converter, which converts the source
 *  character array to the destination character array, calling iconv
 *  recursively to skip invalid characters.
 */
int encoded_base::convert(const char * & src_begin,
                          const char * & src_end,
                          char * & dest_begin,
                          char * & dest_end)
{
    char *end = dest_end - maxUnicodeWidth;
    size_t srclen, dstlen;
    while (src_begin < src_end && dest_begin < end) {
        srclen = src_end - src_begin;
        dstlen = dest_end - dest_begin;
        char *pIn = const_cast<char *>(src_begin);
        iconv(conv, &pIn, &srclen, &dest_begin, &dstlen);
        if (src_begin == pIn) {
            src_begin++;
        } else {
            src_begin = pIn;
        }
    }

    return 0;
}


/** C-style stream converter, which copies source bytes to output
 *  bytes.
 */
int encoded_base::copy(const char * & src_begin,
                          const char * & src_end,
                          char * & dest_begin,
                          char * & dest_end)
{
    size_t srclen = src_end - src_begin;
    size_t dstlen = dest_end - dest_begin;
    size_t length = std::min(srclen, dstlen);

    memmove((void*) dest_begin, (void *) src_begin, length);
    src_begin += length;
    dest_begin += length;

    return 0;
}


/** @brief Processes the input stream through the stream filter.
 */
int encoded_base::process(const char * & src_begin,
                          const char * & src_end,
                          char * & dest_begin,
                          char * & dest_end,
                          int /* flushLevel */)
{
    if (differentCharset) {
        return convert(src_begin, src_end, dest_begin, dest_end);
    } else {
        return copy(src_begin, src_end, dest_begin, dest_end);
    }
}


}   /* detail */
}   /* iostreams */
}   /* boost */

示例程序

#include "encoding.hpp"

#include <boost/iostreams/filtering_streambuf.hpp>
#include <fstream>
#include <string>


int main()
{
    std::ifstream fin("utf8.csv", std::ios::binary);
    std::ofstream fout("utf16le.csv", std::ios::binary);

    // encoding
    boost::iostreams::filtering_streambuf<boost::iostreams::input> streambuf;
    streambuf.push(boost::iostreams::encoded_reader({"UTF-8", "UTF-16LE"}));
    streambuf.push(fin);
    std::istream stream(&streambuf);

    std::string line;
    while (std::getline(stream, line)) {
        fout << line << std::endl;
    }
    fout.close();
}

在上面的例子中,我们将一个UTF-8编码文件的副本写入UTF-16LE,使用streambuffer将UTF-8文本转换为UTF-16LE,我们将其作为字节写入输出,仅为我们的整个过程添加4行(可读)代码。