Question

我的情况是我需要知道String /编码对的大小（以字节为单位），但不能使用getBytes()方法，因为1）String非常大并且复制String数组中的byte[]将使用大量内存，但更多的内容2）getBytes()根据长度分配byte[]数组。 String *每个字符的最大可能字节数。因此，如果我有一个带有1.5B字符和UTF-16编码的String，getBytes()将尝试分配3GB数组并失败，因为数组限制为2 ^ 32 - X字节（X是Java版本具体）。

那么 - 有没有办法直接从String对象计算String /编码对的字节大小？

更新

这是jtahlborn答案的工作实现：

private class CountingOutputStream extends OutputStream {
    int total;

    @Override
    public void write(int i) {
        throw new RuntimeException("don't use");
    }
    @Override
    public void write(byte[] b) {
        total += b.length;
    }

    @Override public void write(byte[] b, int offset, int len) {
        total += len;
    }
}

Answer 1

简单，只需将其写入虚拟输出流：

class CountingOutputStream extends OutputStream {
  private int _total;

  @Override public void write(int b) {
    ++_total;
  }

  @Override public void write(byte[] b) {
    _total += b.length;
  }

  @Override public void write(byte[] b, int offset, int len) {
    _total += len;
  }

  public int getTotalSize(){
     _total;
  }
}

CountingOutputStream cos = new CountingOutputStream();
Writer writer = new OutputStreamWriter(cos, "my_encoding");
//writer.write(myString);

// UPDATE: OutputStreamWriter does a simple copy of the _entire_ input string, to avoid that use:
for(int i = 0; i < myString.length(); i+=8096) {
  int end = Math.min(myString.length(), i+8096);
  writer.write(myString, i, end - i);
}

writer.flush();

System.out.println("Total bytes: " + cos.getTotalSize());

它不仅简单，而且可能与其他“复杂”答案一样快。

Answer 2

这是一个显然有效的实施方案：

import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;

public class TestUnicode {

    private final static int ENCODE_CHUNK = 100;

    public static long bytesRequiredToEncode(final String s,
            final Charset encoding) {
        long count = 0;
        for (int i = 0; i < s.length(); ) {
            int end = i + ENCODE_CHUNK;
            if (end >= s.length()) {
                end = s.length();
            } else if (Character.isHighSurrogate(s.charAt(end))) {
                end++;
            }
            count += encoding.encode(s.substring(i, end)).remaining() + 1;
            i = end;
        }
        return count;
    }

    public static void main(String[] args) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < 100; i++) {
            sb.appendCodePoint(11614);
            sb.appendCodePoint(1061122);
            sb.appendCodePoint(2065);
            sb.appendCodePoint(1064124);
        }
        Charset cs = StandardCharsets.UTF_8;

        System.out.println(bytesRequiredToEncode(new String(sb), cs));
        System.out.println(new String(sb).getBytes(cs).length);
    }
}

输出结果为：

1400
1400

在实践中，我会将ENCODE_CHUNK增加到10MC左右。

可能效率略低于brettw的答案，但实施起来比较简单。

Answer 3

使用apache-commons库：

public static long stringLength(String string, Charset charset) {

    try (NullOutputStream nul = new NullOutputStream();
         CountingOutputStream count = new CountingOutputStream(nul)) {

        IOUtils.write(string, count, charset.name());
        count.flush();
        return count.getCount();
    } catch (IOException e) {
        throw new IllegalStateException("Unexpected I/O.", e);
    }
}

Answer 4

番石榴具有根据post的实现：

Utf8.encodedLength()

Answer 5

好的，这非常严重。我承认，但这些东西被JVM隐藏了，所以我们必须挖掘一下。汗流一点。

首先，我们希望实际的char []支持String而不进行复制。要做到这一点，我们必须使用反射来获取'value'字段：

char[] chars = null;
for (Field field : String.class.getDeclaredFields()) {
    if ("value".equals(field.getName())) {
        field.setAccessible(true);
        chars = (char[]) field.get(string); // <--- got it!
        break;
    }
}

接下来，您需要实现java.nio.ByteBuffer的子类。类似的东西：

class MyByteBuffer extends ByteBuffer {
    int length;            
    // Your implementation here
};

忽略所有 getters ，实现所有 put 方法，例如put(byte)和putChar(char)等。像{{1}这样的内容}，将{em> length 增加1，在put(byte)增量长度内增加数组长度。得到它？放入的所有内容，您可以添加 length 的大小。但是你没有在你的put(byte[])中存储任何东西，你只是计算并扔掉，所以没有空间。如果断开ByteBuffer方法的断点，您可以找出实际需要实现的方法。例如，可能未使用put。

现在为大结局，把它们放在一起：

putFloat(float)

以字节为单位获取带/ encoding的字符串大小而不转换为byte []

5 个答案: