Question

我正在处理一些大型数据集，并尝试提高性能。我需要确定对象是否包含在数组中。我正在考虑使用index或include?，因此我对两者进行了基准测试。

require 'benchmark'

a = (1..1_000_000).to_a
num = 100_000
reps = 100

Benchmark.bmbm do |bm|
  bm.report('include?') do
    reps.times { a.include? num }
  end
  bm.report('index') do
    reps.times { a.index num }
  end
end

令人惊讶的是（对我而言），index要快得多。

               user     system      total        real
include?   0.330000   0.000000   0.330000 (  0.334328)
index      0.040000   0.000000   0.040000 (  0.039812)

由于index提供的信息比include?更多，我原本预计它会稍微慢一些，尽管事实并非如此。为什么它更快？

（我知道index直接来自数组类，而include?是从Enumerable继承的。可能会解释它吗？）

Answer 1

查看Ruby MRI源代码时，似乎index使用优化的rb_equal_opt而include?使用rb_equal。这可以在rb_ary_includes和rb_ary_index中看到。 Here是进行更改的提交。我不清楚为什么在index而不是include?

中使用它

您可能还会发现阅读此feature

的讨论很有意思

Answer 2

如果性能是您的目标，您应该使用 Array#bsearch，它使用二进制搜索遍历数组。

https://ruby-doc.org/core-2.7.0/Array.html#method-i-bsearch

a.bsearch {|a| num <=> a }

它同时抽index和include

Rehearsal --------------------------------------------
include?   0.108172   0.000805   0.108977 (  0.112928)
index      0.122730   0.000502   0.123232 (  0.126323)
bsearch    0.000254   0.000027   0.000281 (  0.000354)
----------------------------------- total: 0.232490sec

               user     system      total        real
include?   0.106727   0.000036   0.106763 (  0.108495)
index      0.107732   0.000330   0.108062 (  0.110272)
bsearch    0.000201   0.000008   0.000209 (  0.000206)

Answer 3

我进行了相同的基准测试。好像包括？比索引快，尽管不是很一致。这是我针对两种不同情况的结果。

红宝石代码与您的相同

               user     system      total        real
index      0.065803   0.000652   0.066455 (  0.067181)
include?   0.065551   0.000590   0.066141 (  0.066894)

另一个基准

                   user     system      total        real
    index      0.000034   0.000005   0.000039 (  0.000037)
    include?   0.000017   0.000001   0.000018 (  0.000017)

代码：

require 'benchmark'

# parse ranks and return number of reports to using index
def solution_using_index(ranks)
  return 0 if ranks.nil? || ranks.empty? || ranks.length <= 1
  return ((ranks[0] - ranks[1] == 1) || (ranks[1] - ranks[0] == 1) ?  1 : 0) if ranks.length == 2
  return 0 if ranks.max > 1000000000 || ranks.min < 0

  grouped_ranks = ranks.group_by(&:itself)
  report_to, rank_keys= 0, grouped_ranks.keys
  rank_keys.each {|rank| report_to += grouped_ranks[rank].length if rank_keys.index(rank+1) }
  report_to
end

# parse ranks and return number of reports to using include
def solution_using_include(ranks)
  return 0 if ranks.nil? || ranks.empty? || ranks.length <= 1
  return ((ranks[0] - ranks[1] == 1) || (ranks[1] - ranks[0] == 1) ?  1 : 0) if ranks.length == 2
  return 0 if ranks.max > 1000000000 || ranks.min < 0

  grouped_ranks = ranks.group_by(&:itself)
  report_to, rank_keys= 0, grouped_ranks.keys
  rank_keys.each {|rank| report_to += grouped_ranks[rank].length if rank_keys.include?(rank+1) }
  report_to
end

test_data = [[3, 4, 3, 0, 2, 2, 3, 0, 0], [4, 4, 3, 3, 1, 0], [4, 2, 0] ]

Benchmark.bmbm do |bm|
  bm.report('index') do
    test_data.each do |ranks|
      reports_to = solution_using_index(ranks)
    end
  end
  bm.report('include?') do
    test_data.each do |ranks|
      reports_to = solution_using_include(ranks)
    end
  end
end

为什么array.index比array.include更快？

3 个答案: