Question

我正在使用UTF-8字符串。我需要使用基于字节的索引来获取切片，而不是基于char的。

我在网上找到String#subseq的引用，它应该像String#[]，但是对于字节。唉，似乎没有达到1.9.1。

现在，我为什么要这样做？如果我在多字节字符的中间切片，我有可能最终得到一个无效的字符串。这听起来像是一个糟糕的主意。

好吧，我正在使用StringScanner，结果发现它的内部指针是基于字节的。我在这里接受其他选择。

这就是我现在正在使用的内容，但它相当冗长：

s.dup.force_encoding("ASCII-8BIT")[ix...pos].force_encoding("UTF-8")

ix和pos都来自StringScanner，因此基于字节。

Answer 1

你也可以这样做：s.bytes.to_a[ix...pos].join("")，但这看起来更加深奥。

如果您多次拨打该线路，可以采用更好的方式：

class String
  def byteslice(*args)
    self.dup.force_encoding("ASCII-8BIT").slice(*args).force_encoding("UTF-8")
  end
end

s.byteslice(ix...pos)

Answer 2

字符串＃字节不符合您的要求吗？它将枚举数返回给字符串中的字节（作为数字，因为它们可能不是有效字符，正如您所指出的那样）

str.bytes.to_a.slice(...)

Answer 3

在String#byteslice()为added to Ruby 1.9之前使用此monkeypatch。

class String
  unless method_defined? :byteslice
    ##
    # Does the same thing as String#slice but
    # operates on bytes instead of characters.
    #
    def byteslice(*args)
      unpack('C*').slice(*args).pack('C*')
    end
  end
end

ruby 1.9：如何获得一个基于字节索引的String片？

3 个答案: