Question

我试图检测以字符串形式读入的文件是否为：

文本（某种类型的单字节编码）。
多字节编码或二进制等

我有一个＆＃34;黑名单＆＃34;字符/字节数组{＆3>}＆＃34; text＆＃34;：

bad_bytes = [0, 1, 2, 3, 4, 5, 6, 11, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 127]

my_bytes = File.binread('some_file').bytes。

我能想到：

(my_bytes & bad_bytes).empty?和
my_bytes == (my_bytes - bad_bytes)

两者都产生了正确的结果，我的直觉是后者可能会更快一些。或者，也许他们完全相同？但就我的目的而言，对我而言，两者似乎都相当低效。我不需要实际找到完整的交集，或者从第一个中删除第二个数组的每个实例 - 查找一个元素就足够了。

我是否遗漏了一些已存在的方法来执行此操作？有更快的技术吗？如果没有，上面哪个更快？或者我接近这一切都错了？

另外，对于奖励积分：对于我尝试做的事情，是否存在数学 / 计算机科学 / 花哨术语这里吗？

Answer 1

您可以使用正则表达式和String#[]来避免转换为字节数组：

bad_bytes_pattern = /[#{ Regexp.escape(bad_bytes.map(&:chr).join) }]/n
#=> /[\x00\x01\x02\x03\x04\x05\x06\v\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1C\x1D\x1E\x1F\x7F]/

str = File.binread('some_file')

if str[bad_bytes_pattern]
  # contains bad bytes
else
  # ...
end

您可以使用字符范围简化正则表达式：

bad_bytes_pattern = /[\x00-\x06\x0B\x0E-\x1A\x1C-\x1F\x7F]/n

Answer 2

您可以使用none?查看bad_bytes列表中是否存在任何字符：

my_bytes.none? { |b| bad_bytes.include? b }

这里的优点是当第一个字符与谓词匹配时循环将停止，而不是遍历整个事件。

您可以将bad_bytes放入Set：

进一步优化

bad_bytes = Set[0, 1, 2, 3, 4, 5, 6, 11, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 127]

Answer 3

<强>先生们！启动引擎！

以下是迄今为止给出的三个答案的基准比较。我这样做的主要原因是评估使用正则表达式的@ Stefan解决方案的相对效率。我的印象是正则表达式通常效率相对较低，但正如您从下面的结果中看到的那样，这肯定不是这样的。

@Uri＆我的解决方案显示通过将坏字符数组转换为集合并通过逐字节读取文件来进行多少改进。我很抱歉，@ Uri，如果我没有按照你原来的方式将文件读入数组。

我希望看到更多的SO答案基准测试。它并不困难或耗时，它可以提供有用的见解。我发现大部分时间都在准备测试用例。请注意，我已将要测试的方法放在模块中，因此如果要对其他方法进行基准测试，我只需要将该方法添加到模块中 - 我不必触及任何其他代码。

方法比较

module Methods
  require 'set'

  Bad_bytes_pattern = /[\x00-\x06\x0B\x0E-\x1A\x1C-\x1F\x7F]/n
  Bad_bytes = [*0..6, 11, *14..26, *28..31, 127]
  Bad_chars = Bad_bytes.map(&:chr)
  Bad_bytes_set = Set[*Bad_bytes]
  Bad_chars_set = Set[*Bad_chars]

  def stefan(fname)
    File.read(fname)[Bad_bytes_pattern]
  end

  def uri_with_array(fname)
    !File.read(fname).each_char.map(&:ord).none? { |b|
      Bad_bytes.include? b }
  end

  def uri_with_set(fname)
    !File.read(fname).each_char.map(&:ord).none? { |b|
      Bad_bytes_set.include? b }
  end

  def cary(fname)
    f = File.new fname
    f.each_char.any? { |c| Bad_chars_set.include?(c) }
  end
end

包含模块

include Methods
@methods = Methods.instance_methods(false)
  #=> [:stefan, :uri_with_array, :uri_with_set, :cary]

创建测试文件

def make_test_files(prefix, nbr_files, file_size, prob_bad_byte)
  nbr_bad_bytes = Bad_bytes.size
  nbr_files.times.with_object([]) do |i, fnames|
    str = 'x'*file_size
    str[rand(file_size)] = Bad_chars[rand(nbr_bad_bytes)] if
      rand < prob_bad_byte
    fname = "#{prefix}.#{i}"
    File.write(fname, str)
    fnames << fname
  end
end

N = 50
M = 100_000
Prob_bad_byte = 0.5

@test_files = make_test_files('test', N, M, Prob_bad_byte)

创建辅助方法

调用方法m来处理所有测试文件并返回true / false数组，如果在给定文件中找到错误的字节，则为true：

def compute(m)
  @test_files.each_with_object([]) { |fname,arr|
    arr << (send(m, fname) ? true : false) }
end

编写测试标题

puts "#{N} files of size #{M}.\n" +
  "Each file contains zero or one bad characters, the probability of the " +
  "latter being #{Prob_bad_byte}. If a bad character is present, it is at " +
  "a random location in the file.\n\n"

确认所有正在测试的方法都返回相同的测试数据值

unless @methods.map { |m| compute(m) }.uniq.size == 1
  print "Not all methods agree"
  exit
end

撰写基准

require 'benchmark'

@indent = methods.map { |m| m.to_s.size }.max

Benchmark.bm(@indent) do |bm|
  @methods.each do |m|
    bm.report m.to_s do
      compute(m)
    end
  end
end

后清理

@test_files.each { |fname| File.delete fname }

手动编码测试参数的结果

50个大小为10000的文件。每个文件包含零个或一个坏字符，后者的概率为0.5。如果存在错误字符，则它位于文件中的随机位置。

                                 user     system      total        real
stefan                       0.000000   0.000000   0.000000 (  0.003874)
uri_with_array               0.560000   0.000000   0.560000 (  0.565312)
uri_with_set                 0.170000   0.010000   0.180000 (  0.173694)
cary                         0.100000   0.000000   0.100000 (  0.100730)

50个大小为100000的文件。每个文件包含零个或一个坏字符，后者的概率为0.5。如果存在错误字符，则它位于文件中的随机位置。

                                 user     system      total        real
stefan                       0.030000   0.000000   0.030000 (  0.027062)
uri_with_array               5.340000   0.040000   5.380000 (  5.387314)
uri_with_set                 1.640000   0.040000   1.680000 (  1.683844)
cary                         0.930000   0.010000   0.940000 (  0.929722)

50个大小为100000的文件。每个文件包含零个或一个坏字符，后者的概率为1.0。如果存在错误字符，则它位于文件中的随机位置。

                                 user     system      total        real
stefan                       0.020000   0.010000   0.030000 (  0.022462)
uri_with_array               4.410000   0.030000   4.440000 (  4.447397)
uri_with_set                 1.520000   0.040000   1.560000 (  1.560788)
cary                         0.740000   0.010000   0.750000 (  0.747580)

Answer 4

我建议做两件事来提高效率：

逐字节读取文件（在后台逐块），直到找到错误的字符或读取整个文件而不会找到错误的字符。
将坏字节数组转换为一组字符，以便更快地查找。

<强>代码

require 'set'

def bad_byte?(text, bad_bytes)
  bb = Set.new(bad_bytes.map(&:chr))
  f = File.new 'test'
  f.each_char.any? { |c| bb.include?(c) }
end

<强>实施例

bad_bytes = [*0..6, 11, *14..26, *28..31, 127]
  #=> [ 0,  1,  2,  3,  4,  5,  6, 11, 14, 15, 16, 17,  18,
  #    19, 20, 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 127]

Read a 'good' test string from a file named `'test'`.

text = "Now is the time for all good people"
File.write('test', text)
bad_byte?(text, bad_bytes) #=> false

Read a 'bad' test string from a file named `'test'`.

text = "Now is the time " + 3.chr + "for all good people"
File.write('test', text)
bad_byte?(text, bad_bytes) #=> true

查找数组是否包含另一个数组的任何成员的最快方法？

4 个答案: