我正在处理有很多重复行的东西:
# => [ [1, "A", 23626], [1, "A", 31314], [2, "B", 2143], [2, "B", 5247] ]
puts xs
# => [ [1, "A"], [2, "B"] ]
puts xs.uniq{ |x| x[0] }.map{ |x| [x[0], x[1]] }
但是 xs 是巨大的。我试图懒洋洋地加载它,但 Enumerator #Lazy 没有 uniq 方法。
如何懒惰地实现这个目标?
答案 0 :(得分:7)
module EnumeratorLazyUniq
refine Enumerator::Lazy do
require 'set'
def uniq
set = Set.new
select { |e|
val = block_given? ? yield(e) : e
!set.include?(val).tap { |exists|
set << val unless exists
}
}
end
end
end
using EnumeratorLazyUniq
xs = [ [1, "A", 23626], [1, "A", 31314], [2, "B", 2143], [2, "B", 5247] ].to_enum.lazy
us = xs.uniq{ |x| x[0] }.map{ |x| [x[0], x[1]] }
puts us.to_a.inspect
# => [[1, "A"], [2, "B"]]
# Works with a block
puts us.class
# => Enumerator::Lazy
# Yep, still lazy.
ns = [1, 4, 6, 1, 2].to_enum.lazy
puts ns.uniq.to_a.inspect
# => [1, 4, 6, 2]
# Works without a block
这是使用Set
直接实现的;这意味着任何uniq的值(即[1, "A"]
之类的内容,而不是[1, "A", 23626]
之类的流元素本身)将占用内存。
答案 1 :(得分:2)
我决定对我建议的两种方法和@Amadan的方法进行基准测试。结果不言而喻。
基准代码
require 'benchmark'
module EnumeratorLazyUniq
refine Enumerator::Lazy do
require 'set'
def uniq
set = Set.new
select { |e|
val = block_given? ? yield(e) : e
!set.include?(val).tap { |exists|
set << val unless exists
}
}
end
end
end
using EnumeratorLazyUniq
def amadan(xs)
xs.uniq{ |x| x[0] }.map{ |x| [x[0], x[1]] }
end
require 'set'
def cary_set(arr)
first = Set.new
arr.each_with_object([]) do |(a0, a1, *_), b|
unless first.include?(a0)
first << a0
b << [a0, a1]
end
end
end
def cary_hash(arr)
arr.each_with_object({}) { |(a0, a1, *_), h|
h[a0]=[a0, a1] unless h.key?(a0) }.values
end
测试数据
n_uniq = 10_000
n_copies = 100
tot = n_uniq * n_copies
xs = tot.times.map { |i| [i % n_uniq, 0, 1] }
运行基准
Benchmark.bm do |x|
x.report("cary_set ") { cary_set(xs) }
x.report("cary_hash") { cary_hash(xs) }
x.report("amadan ") { amadan(xs) }
end
<强>结果
Unique elements: 200,000
Number of copies of each unique element: 5
Array size: 1,000,000
user system total real
cary_set 0.980000 0.030000 1.010000 ( 1.018618)
cary_hash 0.980000 0.010000 0.990000 ( 0.982508)
amadan 0.590000 0.010000 0.600000 ( 0.597249)
Unique elements: 100,000
Number of copies of each unique element: 10
Array size: 1,000,000
user system total real
cary_set 0.920000 0.030000 0.950000 ( 0.942539)
cary_hash 0.630000 0.020000 0.650000 ( 0.642367)
amadan 0.470000 0.000000 0.470000 ( 0.478658)
Unique elements: 50,000
Number of copies of each unique element: 20
Array size: 1,000,000
user system total real
cary_set 0.910000 0.020000 0.930000 ( 0.932277)
cary_hash 0.570000 0.000000 0.570000 ( 0.575439)
amadan 0.410000 0.010000 0.420000 ( 0.417695)
Unique elements: 1000000
Number of copies of each unique element: 10
Array size: 10000000
user system total real
cary_set 12.660000 0.270000 12.930000 ( 12.962183)
cary_hash 7.730000 0.060000 7.790000 ( 7.797486)
amadan 6.640000 0.060000 6.700000 ( 6.707706)
答案 2 :(得分:1)
为什么不简单?
<强>代码强>
require 'set'
def extract(arr)
first = Set.new
arr.each_with_object([]) do |(a0, a1, *_), b|
unless first.include?(a0)
first << a0
b << [a0, a1]
end
end
end
示例强>
arr = [ [1, "A", 23626], [1, "A", 31314], [2, "B", 2143], [2, "B", 5247] ]
extract(arr)
#=> [[1, "A"], [2, "B"]]
<强>替代强>
其中一个变体是:
def extract(arr)
arr.each_with_object({}) { |(a0, a1, *_), h|
h[a0]=[a0, a1] unless h.key?(a0) }.values
end
我希望性能大致相同,但哈希使用更多内存,因为values