Question

队列名称最多可包含255个字节的UTF-8字符。

在ruby（1.9.3）中，如何在不中断字符的情况下按字节数截断UTF-8字符串？结果字符串应该是符合字节限制的最长的有效UTF-8字符串。

Answer 1

对于Rails＆gt; = 3.0，您有ActiveSupport :: Multibyte :: Chars限制方法。

来自API文档：

- (Object) limit(limit)

将字符串的字节大小限制为字节数而不会破坏字符。当字符串的存储因某种原因受限时可用。

示例：

'こんにちは'.mb_chars.limit(7).to_s # => "こん"

Answer 2

bytesize将以字节为单位给出字符串的长度（只要字符串的编码设置正确），像slice这样的操作不会破坏字符串。

一个简单的过程就是遍历字符串

s.each_char.each_with_object('') do|char, result| 
  if result.bytesize + char.bytesize > 255
    break result
  else
    result << char
  end
end

如果你很狡猾，你可以直接复制前63个字符，因为在utf-8中任何unicode字符最多只能有4个字节。

请注意，这仍然不完美。例如，假设你的字符串的最后4个字节是字符'e'并且结合了尖锐的重音。切片最后2个字节会产生一个仍然是utf8的字符串，但就用户看到的内容而言，会将输出从“é”更改为“e”，这可能会改变文本的含义。当你只是命名RabbitMQ队列时，这可能不是什么大不了的事，但在其他情况下可能很重要。例如，在法语中，简报标题为“Unpoliciertué”的意思是“一名警察被杀”，而“非政治犯”则意味着“警察杀人”。

Answer 3

我想我发现了一些有用的东西。

def limit_bytesize(str, size)
  str.encoding.name == 'UTF-8' or raise ArgumentError, "str must have UTF-8 encoding"

  # Change to canonical unicode form (compose any decomposed characters).
  # Works only if you're using active_support
  str = str.mb_chars.compose.to_s if str.respond_to?(:mb_chars)

  # Start with a string of the correct byte size, but
  # with a possibly incomplete char at the end.
  new_str = str.byteslice(0, size)

  # We need to force_encoding from utf-8 to utf-8 so ruby will re-validate
  # (idea from halfelf).
  until new_str[-1].force_encoding('utf-8').valid_encoding?
    # remove the invalid char
    new_str = new_str.slice(0..-2)
  end
  new_str
end

用法：

>> limit_bytesize("abc\u2014d", 4)
=> "abc"
>> limit_bytesize("abc\u2014d", 5)
=> "abc"
>> limit_bytesize("abc\u2014d", 6)
=> "abc—"
>> limit_bytesize("abc\u2014d", 7)
=> "abc—d"

<强>更新...

没有active_support的分解行为：

>> limit_bytesize("abc\u0065\u0301d", 4)
=> "abce"
>> limit_bytesize("abc\u0065\u0301d", 5)
=> "abce"
>> limit_bytesize("abc\u0065\u0301d", 6)
=> "abcé"
>> limit_bytesize("abc\u0065\u0301d", 7)
=> "abcéd"

使用active_support分解行为：

>> limit_bytesize("abc\u0065\u0301d", 4)
=> "abc"
>> limit_bytesize("abc\u0065\u0301d", 5)
=> "abcé"
>> limit_bytesize("abc\u0065\u0301d", 6)
=> "abcéd"

Answer 4

这个怎么样：

s = "δogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδog"
count = 0
while true
  more_truncate = "a" + (255-count).to_s
  s2 = s.unpack(more_truncate)[0]
  s2.force_encoding 'utf-8'

  if s2[-1].valid_encoding?
    break
  else
    count += 1
  end
end

s2.force_encoding 'utf-8'
puts s2

Answer 5

第6轨将提供一个String#truncate_bytes，其行为类似于truncate，但采用字节计数而不是字符计数。并且，当然，它返回一个有效的字符串（它不会在多字节char的中间盲切）。

来自文档：

>> "????????????????????".size
=> 20
>> "????????????????????".bytesize
=> 80
>> "????????????????????".truncate_bytes(20)
=> "????…"

Answer 6

无导轨

Fredrick Cheung's answer 是一个极好的 O(n) 起点，启发了这个 O(log n) 解决方案：

def limit_bytesize(str, max_bytesize)
  return str unless str.bytesize > max_bytesize

  # find the minimum index that exceeds the bytesize, then subtract 1
  just_over = (0...str.size).bsearch { |l| str[0..l].bytesize > max_bytesize }
  str[0..(just_over - 1)]
end

我相信这也实现了该答案中提到的自动 max_bytesize / 4 加速，因为 bsearch 从中间开始。

Ruby：按字节长度限制UTF-8字符串

6 个答案:

无导轨