Question

在Julia中检测和删除数组中重复行的最佳方法是什么？

x = Integer.(round.(10 .* rand(1000,4)))

# In R I would apply the duplicated function.
x = x[duplicated(x),:]

Answer 1

unique ~~正在寻找~~ :(这不回答检测部分的问题。）

df3.loc[m, 'score'] = df3[m].apply(func, axis=1)

对于检测部分，脏修复将编辑this line：

julia> x = Integer.(round.(10 .* rand(1000,4)))
1000×4 Array{Int64,2}:
 7  3  10   1
 7  4   8   9
 7  7   3   0
 3  4   8   2
 ⋮           
julia> unique(x, 1)
973×4 Array{Int64,2}:
 7  3  10   1
 7  4   8   9
 7  7   3   0
 3  4   8   2
 ⋮

为：

@nref $N A d->d == dim ? sort!(uniquerows) : (indices(A, d))

或者，您可以使用上述更改定义自己的(@nref $N A d->d == dim ? sort!(uniquerows) : (indices(A, d))), uniquerows：

unique2

我的机器上的基准是：

using Base.Cartesian
import Base.Prehashed

@generated function unique2(A::AbstractArray{T,N}, dim::Int) where {T,N}
......
end

julia> y, idx = unique2(x, 1)

julia> y
960×4 Array{Int64,2}:
  8   3   1   5
  8   3   1   6
  1   1   0   1
  8  10   1  10
  9   1   8   7
  ⋮ 

julia> setdiff(1:1000, idx)
40-element Array{Int64,1}:
  99
 120
 132
 140
 216
 227
  ⋮

结果表明，x = rand(1:10,1000,4) # 48 dups @btime unique2($x, 1); 124.342 μs (31 allocations: 145.97 KiB) @btime duplicated($x); 407.809 μs (9325 allocations: 394.78 KiB) x = rand(1:4,1000,4) # 751 dups @btime unique2($x, 1); 66.062 μs (25 allocations: 50.30 KiB) @btime duplicated($x); 222.337 μs (4851 allocations: 237.88 KiB)中的复杂元编程哈希表方式可以从较低的内存分配中获益良多。

Answer 2

你也可以选择：

duplicated(x) = foldl(
  (d,y)->(x[y,:] in d[1] ? (d[1],push!(d[2],y)) : (push!(d[1],x[y,:]),d[2])), 
  (Set(), Vector{Int}()), 
  1:size(x,1))[2]

这会收集一组看到的行，并输出已经看过的行的索引。这基本上是获得结果所需的最小努力，所以它应该很快。

julia> x = rand(1:2,5,2)
5×2 Array{Int64,2}:
 2  1
 1  2
 1  2
 1  1
 1  1

julia> duplicated(x)
2-element Array{Int64,1}:
 3
 5

julia> x[duplicated(x),:]
2×2 Array{Int64,2}:
 1  2
 1  1

Answer 3

Julia v1.4 及更高版本，您需要输入 unique(a, dims=1)，其中 a 是您的 N x 2 数组

julia> a=[2 2 ; 2 2; 1 2; 3 1]
4×2 Array{Int64,2}:
 2  2
 2  2
 1  2
 3  1

julia> unique(a,dims=1)
3×2 Array{Int64,2}:
 2  2
 1  2
 3  1

朱莉娅：检测并删除数组中的重复行？

3 个答案: