如何测量非ASCII字符的正确大小?

时间:2017-10-26 06:33:33

标签: c++ string c++11 size non-ascii-characters

在下面的程序中,我试图用非ASCII字符来测量字符串的长度。

但是,我不确定为什么size()在使用非ASCII字符时不会打印正确的长度。

#include <iostream>
#include <string>

int main()
{
    std::string s1 = "Hello";
    std::string s2 = "इंडिया"; // non-ASCII string
    std::cout << "Size of " << s1 << " is " << s1.size() << std::endl;
    std::cout << "Size of " << s2 << " is " << s2.size() << std::endl;
}

输出:

Size of Hello is 5
Size of इंडिया is 18

现场演示Wandbox

2 个答案:

答案 0 :(得分:4)

std::string::size以字节为单位返回长度,而不是字符数。您的第二个字符串使用UNICODE编码,因此每个字符可能需要几个字节。请注意,这同样适用于std::wstring::size,因为它取决于编码(它返回宽字符的数量,而不是实际字符:如果使用UTF-16,它将匹配,但不一定适用于其他编码,更多{ {3}})。

要测量实际长度(符号数),您需要知道编码,以便正确分离(并因此计算)字符。 in this answer可能对UTF-8有帮助(虽然在C ++ 17中不推荐使用该方法)。

UTF-8的另一个选项是计算第一个字节的数量(This answer):

int utf8_length(const std::string& s) {
  int len = 0;
  for (auto c : s)
      len += (c & 0xc0) != 0x80;
  return len;
}

答案 1 :(得分:1)

我使用了std::wstring_convert类并获得了正确的字符串长度。

public static void SetPcapData(String directory){
    final StringBuilder errbuf = new StringBuilder();

    //archivo cargara en memoria el paquete .pcap
    Log.i("Abriendo PCAP desde", directory);
    Pcap pcapfile = Pcap.openOffline(directory, errbuf);

    if (pcapfile == null) {
        Log.e("Error al abrir PCAP", errbuf.toString());
    }

    Ethernet eth = new Ethernet();
    Http http = new Http();
    Ip4 ip4 = new Ip4();
    Tcp tcp = new Tcp();
    Udp udp = new Udp();

    PcapHeader hdr = new PcapHeader(JMemory.POINTER);
    //PcapPacket packet = new PcapPacket(JMemory.POINTER);
    JBuffer buf = new JBuffer(JMemory.POINTER);
    assert pcapfile != null;
    int id = JRegistry.mapDLTToId(pcapfile.datalink());
    int contIP, contETH, contHTTP, contUDP, contTCP;
    contIP = contETH = contHTTP = contUDP = contTCP = 1;

    while(pcapfile.nextEx(hdr, buf) == Pcap.NEXT_EX_OK) {
        PcapPacket packet = new PcapPacket(hdr, buf);
        packet.scan(id);
        String str;

        Log.i("::::", "-----------------------------------------------------------------------");
        if (packet.hasHeader(eth)) {
            str = eth.toString();
            Log.i("#" + String.valueOf(contETH) + " ETH src", FormatUtils.mac(eth.source()) + " | " + FormatUtils.mac(eth.destination()));
            ethData.add(str);

            contETH++;

            if (packet.hasHeader(ip4)) {
                str = FormatUtils.ip(ip4.source());
                Log.i("#" + String.valueOf(contIP) + " IP src", str);
                ipSource.add(str);

                str = FormatUtils.ip(ip4.destination());
                Log.i("#" + String.valueOf(contIP) + " IP dest", str);
                ipDestination.add(str);

                contIP++;

                if (packet.hasHeader(tcp)) {
                    str = String.valueOf(tcp.source()) + " | " + String.valueOf(tcp.destination());
                    Log.i("#" + String.valueOf(contTCP) + " TCP src|dest port", str);
                    tcpPortSource.add(String.valueOf(tcp.source()));
                    tcpPortDestination.add(String.valueOf(tcp.destination()));

                    contTCP++;

                } else if (packet.hasHeader(udp)) {
                    str = String.valueOf(udp.source()) + " | " + String.valueOf(udp.destination());
                    Log.i("#" + String.valueOf(contUDP) + " UDP src|dest port", str);
                    udpPortSource.add(String.valueOf(udp.source()));
                    udpPortDestination.add(String.valueOf(udp.destination()));

                    contUDP++;

                    if(udp.source() == 53 || udp.destination() == 53){
                        //here is where I need to start extracting DNS packets
                    }
                }
            }
        }
    }

    pcapfile.close();
}

现场演示wandbox

重要性参考链接here了解有关#include <string> #include <iostream> #include <codecvt> int main() { std::string s1 = "Hello"; std::string s2 = "इंडिया"; // non-ASCII string std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cn; auto sz = cn.from_bytes(s2).size(); std::cout << "Size of " << s2 << " is " << sz << std::endl; }

的更多信息