在Julia中返回数组中唯一元素数量的函数是什么?
在R中,您有length(unique(x))
。我可以在Julia中做同样的事情,但是我认为应该有一种更有效的方法。
答案 0 :(得分:6)
如果您想要一个准确的答案,length(unique(x))
与普通对象一样有效。如果您的值具有有限的域,例如UInt8
,则使用固定大小的表可能会更有效。如果您可以接受近似值,则可以使用HyperLogLog数据结构/算法,该数据结构/算法在OnlineStats包中实现:
https://joshday.github.io/OnlineStats.jl/latest/api/#OnlineStats.HyperLogLog
答案 1 :(得分:4)
如果之后不需要x数组,则length(unique!(x))
会稍微快一些。
对于Floats和Integers,如果数组已经排序,则可以使用map reduce。
function count_unique_sorted(x)
f(a) = (a,0)
function op(a,b)
if a[1] == b[1]
return (b[1],a[2])
else
return (b[1],a[2]+1)
end
end
return mapreduce(f,op,x)[2]+1
end
如果您不关心数组x
的顺序,则可以对一个函数进行排序和计数:
count_unique_sorted!(x)=count_unique_sorted(sort!(x))
一些基准:
using Random,StatsBase, BenchmarkTools
x = sample(1:100,200)
length(unique(x)) == count_unique_sorted(sort(x)) #true
使用length(unique(x))
:
@benchmark length(unique(x))
BenchmarkTools.Trial:
memory estimate: 6.08 KiB
allocs estimate: 17
--------------
minimum time: 3.350 μs (0.00% GC)
median time: 3.688 μs (0.00% GC)
mean time: 5.352 μs (24.35% GC)
maximum time: 6.691 ms (99.90% GC)
--------------
samples: 10000
evals/sample: 8
使用Set
:
@benchmark length(Set(x))
BenchmarkTools.Trial:
memory estimate: 2.82 KiB
allocs estimate: 8
--------------
minimum time: 2.256 μs (0.00% GC)
median time: 2.467 μs (0.00% GC)
mean time: 3.654 μs (26.04% GC)
maximum time: 5.297 ms (99.91% GC)
--------------
samples: 10000
evals/sample: 9
使用count_unique_sorted!
:
x2 = copy(x)
@benchmark count_unique_sorted!(x2)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 948.387 ns (0.00% GC)
median time: 990.323 ns (0.00% GC)
mean time: 1.038 μs (0.00% GC)
maximum time: 2.481 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 31
将count_unique_sorted
与已排序的数组一起使用
x3 = sort(x)
@benchmark count_unique_sorted(x3)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 140.962 ns (0.00% GC)
median time: 146.831 ns (0.00% GC)
mean time: 154.121 ns (0.00% GC)
maximum time: 381.806 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 852
使用count_unique_sorted
并对数组进行排序
@benchmark count_unique_sorted(sort(x))
BenchmarkTools.Trial:
memory estimate: 1.77 KiB
allocs estimate: 1
--------------
minimum time: 1.470 μs (0.00% GC)
median time: 1.630 μs (0.00% GC)
mean time: 2.367 μs (21.82% GC)
maximum time: 4.880 ms (99.94% GC)
--------------
samples: 10000
evals/sample: 10
对于字符串,排序和计数比创建Set慢。
答案 2 :(得分:4)
看来length(Set(x))
比length(unique(x))
快一些。
julia> using StatsBase, BenchmarkTools
julia> num_unique(x) = length(Set(x));
julia> a = sample(1:100, 200);
julia> num_unique(x) == length(unique(x))
true
julia> @benchmark length(unique(x)) setup=(x = sample(1:10000, 20000))
BenchmarkTools.Trial:
memory estimate: 450.50 KiB
allocs estimate: 36
--------------
minimum time: 498.130 μs (0.00% GC)
median time: 570.588 μs (0.00% GC)
mean time: 579.011 μs (2.41% GC)
maximum time: 2.321 ms (63.03% GC)
--------------
samples: 5264
evals/sample: 1
julia> @benchmark num_unique(x) setup=(x = sample(1:10000, 20000))
BenchmarkTools.Trial:
memory estimate: 288.68 KiB
allocs estimate: 8
--------------
minimum time: 283.031 μs (0.00% GC)
median time: 393.317 μs (0.00% GC)
mean time: 397.878 μs (4.24% GC)
maximum time: 33.499 ms (98.80% GC)
--------------
samples: 6704
evals/sample: 1
另一个测试字符串数组的基准:
julia> using Random
julia> @benchmark length(unique(x)) setup=(x = [randstring(3) for _ in 1:10000])
BenchmarkTools.Trial:
memory estimate: 450.50 KiB
allocs estimate: 36
--------------
minimum time: 818.024 μs (0.00% GC)
median time: 895.944 μs (0.00% GC)
mean time: 906.568 μs (1.61% GC)
maximum time: 1.964 ms (51.19% GC)
--------------
samples: 3049
evals/sample: 1
julia> @benchmark num_unique(x) setup=(x = [randstring(3) for _ in 1:10000])
BenchmarkTools.Trial:
memory estimate: 144.68 KiB
allocs estimate: 8
--------------
minimum time: 367.018 μs (0.00% GC)
median time: 378.666 μs (0.00% GC)
mean time: 384.486 μs (1.07% GC)
maximum time: 1.314 ms (70.80% GC)
--------------
samples: 4527
evals/sample: 1