Question

我有一个大矩阵data我想要＆＃34;组织＆＃34;以某种方式。矩阵有5列，大约2百万行。前4列是每个观察的特征（这些是整数），最后一列是我感兴趣的结果变量（这包含实数）。我想在一个数组数组中组织这个矩阵。由于data非常大，我试图并行化此操作：

addprocs(3)

@everywhere data = readcsv("datalocation", Int)

@everywhere const Z = 65
@everywhere const z = 16
@everywhere const Y = 16
@everywhere const y = 10

@everywhere const arr = Array{Vector}(Z-z+1,Y-y+1,Z-z+1,Y-y+1)

@parallel (vcat) for a1 in z:Z, e1 in y:Y, a2 in z:Z, e2 in y:Y
    arr[a1-z+1,e1-y+1,a2-z+1,e2-y+1] = data[(data[:,1].==a1) & (data[:,2].==e1) & (data[:,3].==a2) & (data[:,4].==e2), end]
end

但是当我尝试运行for循环时出现错误：

Error: syntax: invalid assignment location

循环结束后，我希望arr可用于所有处理器。我做错了什么？

编辑：输入矩阵data看起来像这样（行没有特定的顺序）：

16   10   16   10   100
16   10   16   11   200
20   12   21   13   500
16   10   16   10   300
20   12   21   13   500

请注意，某些行可以重复，而其他行则具有相同的＆＃34;键＆＃34;但不同的第五栏。

我想要的输出看起来像这样（请注意我如何使用arr的尺寸作为＆＃34;键＆＃34;用于＆＃34;字典＆＃34;：

arr[16-z+1, 10-y+1, 16-z+1, 10-y+1] = [100, 300]
arr[16-z+1, 10-y+1, 16-z+1, 11-y+1] = [200]
arr[20-z+1, 12-y+1, 21-z+1, 13-y+1] = [500, 500]

也就是说，索引arr的{{1}}元素是向量(16-z+1, 10-y+1, 16-z+1, 10-y+1)。我不关心行的排序或最后一列向量的排序。

Answer 1

注意：我最初误解了你的问题。我曾经以为你试图在你的工人中分割数据，但我现在看到的并不是你所追求的。我写了一些简单的例子，说明了可以实现的方法。如果将来有人发现它们有用，我会把它们作为回应留下来。

开始使用：

writedlm("path/to/data.csv", rand(100,10), ',')
addprocs(4)

选项1：

function sendto(p::Int; args...)
    for (nm, val) in args
        @spawnat(p, eval(Main, Expr(:(=), nm, val)))
    end
end

Data = readcsv("/path/to/data.csv")

for (idx, pid) in enumerate(workers())
    Start = (idx-1)*25 + 1
    End = Start + 24
    sendto(pid, Data = Data[Start:End,])
end

选项2：

@everywhere begin
    if myid() != 1
        Start = (myid()-2)*25 + 1
        End = Start + 24
        println(Start)
        println(End)
        Data = readcsv("path/to/data.csv")[Start:End,:]
    end
end

# verify everything looks right for what got sent
@everywhere if myid()!= 1 println(typeof(Data)) end
@everywhere if myid()!= 1 println(size(Data)) end

选项3：

for (idx, pid) in enumerate(workers())
    Start = (idx-1)*25 + 1
    End = Start + 24
    sendto(pid, Start = Start, End = End)
end

@everywhere if myid()!= 1 Data = readcsv("path/to/data.csv")[Start:End,:] end

Answer 2

这对你有用吗？我试图通过重复你给它1000次的片段来模拟你的数据。它并不像我想要的那样优雅，特别是，我无法让remotecall_fetch()像我想要的那样工作（即使用@async包裹它）所以我有将调用和提取分成两个步骤。让我知道这看起来如何。

addprocs(n)

@everywhere begin
    if myid() != 1
        multiplier = 10^3;
        Data = readdlm("/path/to/Input.txt")
        global data = kron(Data,ones(multiplier));
        println(size(data))
    end
end

@everywhere begin
    function Select_Data(a1, e1, a2, e2, data=data)
        return data[(data[:,1].==a1) & (data[:,2].==e1) & (data[:,3].==a2) & (data[:,4].==e2), end]
    end
end

n_workers = nworkers()
function next_pid(pid, n_workers)
    if pid <= n_workers
        return pid + 1
    else
        return 2
    end
end

const arr = Array{Any}(Z-z+1,Y-y+1,Z-z+1,Y-y+1);
println("Beginning Processing Work")
@sync begin
    pid = 2
    for a1 in z:Z, e1 in y:Y, a2 in z:Z, e2 in y:Y
        pid = next_pid(pid, n_workers)
        arr[a1-z+1,e1-y+1,a2-z+1,e2-y+1] = remotecall(pid, Select_Data, a1, e1, a2, e2)
    end
end
println("Retrieving Completed Jobs")
@sync begin
    pid = 2
    for a1 in z:Z, e1 in y:Y, a2 in z:Z, e2 in y:Y
        arr[a1-z+1,e1-y+1,a2-z+1,e2-y+1] = fetch(arr[a1-z+1,e1-y+1,a2-z+1,e2-y+1])
    end
end

并行化数据处理

2 个答案: