Question

我正在尝试加速循环，其中连续的数据帧与第一列连接，第一列作为键。数据帧由函数my_function生成。第一列名为:REF。连续的数据帧可能比第一个更短，因此我不能直接分配给DF列，就像在pandas中那样。

base_df = my_function(elem1)

for elem in elems[2:end]
    tmp = my_function(elem)
    base_df = join(base_df, tmp, on=:REF, kind=:left)
end

有没有办法将数据框列表合并为一个？谢谢，

PS：DataFrame有不同的类型：String，Int，Float64。

更新。因此，示例DataFrames：

df1 = DataFrame(REF = 1:5, D1=rand(5))
df2 = DataFrame(REF = 1:3, D1=rand(3))
df3 = DataFrame(REF = 1:4, D1=rand(4))

我正在寻找它将这三个（或更多）组合到一个DataFrame中。请注意行数差异。

Upd2。抱歉，它应该是df1，df2和df3（D1，D2和D3）上的不同列。这是DF的正确设置

df1 = DataFrame(REF = 1:5, D1=rand(5))
df2 = DataFrame(REF = 1:3, D2=rand(3))
df3 = DataFrame(REF = 1:4, D3=rand(4))

Answer 1

设置答案，假设：

df1 = DataFrame(REF = 1:5, D1=rand(5))
df2 = DataFrame(REF = 1:3, D1=rand(3))
df3 = DataFrame(REF = 1:4, D1=rand(4))

elems = [df1, df2, df3]
my_function = identity

现在生成大型DataFrame的代码：

dfs = my_function.(elems)
base_df = DataFrame(Dict([f=>vcat(getindex.(dfs,f)...) for f in names(dfs[1])]...))

给出类似的东西：

12×2 DataFrames.DataFrame
│ Row │ D1         │ REF │
├─────┼────────────┼─────┤
│ 1   │ 0.664144   │ 1   │
│ 2   │ 0.119155   │ 2   │
│ 3   │ 0.471053   │ 3   │
│ 4   │ 0.547811   │ 4   │
│ 5   │ 0.600263   │ 5   │
│ 6   │ 0.21306    │ 1   │
│ 7   │ 0.985412   │ 2   │
│ 8   │ 0.886738   │ 3   │
│ 9   │ 0.00926173 │ 1   │
│ 10  │ 0.701962   │ 2   │
│ 11  │ 0.328322   │ 3   │
│ 12  │ 0.753062   │ 4   │

这种方法减少了从二次方使用到线性的内存（并且性能随内存减少而提高）

<强>更新

随着新细节的曝光（以及我对该问题的理解得到改善），以下是更好地生成所需base_df的代码：

df1 = DataFrame(REF = 1:5, D1=rand(5))
df2 = DataFrame(REF = 1:3, D2=rand(3))
df3 = DataFrame(REF = 1:4, D3=rand(4))
elems = [df1, df2, df3]

cols = [(i,f) for (i,t) in enumerate(elems) for f in names(t) if !(f == :REF)]
rows = union(getindex.(elems,:REF)...)
ref2row = Dict(v=>i for (i,v) in enumerate(rows))

pre_df = Dict{Symbol,DataVector{Any}}([c[2]=>DataArray(eltype(elems[c[1]][c[2]]),
 length(rows)) for c in cols])

foreach(tpl -> pre_df[tpl[3][1]][ref2row[tpl[2]]] = tpl[3][2],
 [(i,r[:REF],v) 
  for (i,t) in enumerate(elems) 
  for r in eachrow(t) 
  for v in r if v[1] != :REF
 ])

pre_df[:REF] = [ref2row[i] for i=1:length(rows)]

base_df = DataFrame(pre_df)

，并提供：

5×4 DataFrames.DataFrame
│ Row │ D1       │ D2       │ D3        │ REF │
├─────┼──────────┼──────────┼───────────┼─────┤
│ 1   │ 0.93479  │ 0.582954 │ 0.133983  │ 1   │
│ 2   │ 0.472456 │ 0.992173 │ 0.32442   │ 2   │
│ 3   │ 0.365478 │ 0.117772 │ 0.62522   │ 3   │
│ 4   │ 0.976192 │ NA       │ 0.0861988 │ 4   │
│ 5   │ 0.76358  │ NA       │ NA        │ 5   │

Answer 2

这是一种替代方法，假设您需要左连接（如您的问题 - 如果您需要其他类型的连接，则应该很容易调整它）。与Dan Getz解决方案的不同之处在于它不使用DataVector，而是在允许missing的数组上运行（您可以通过在生成的showcols上运行DataFrame来检查差异;因为我们知道它们的类型后，使用这些数据会更有效率：

function joiner(ref_left, ref_right, val_right)
    x = DataFrames.similar_missing(val_right, length(ref_left))
    j = 1
    for i in 1:length(ref_left)
        while ref_left[i] > ref_right[j]
            j += 1
            j > length(ref_right) && return x
        end
        if ref_left[i] == ref_right[j]
            x[i] = val_right[j]
        end
    end
    return x
end

function left_join_sorted(elems::Vector{DataFrame}, on::Symbol)
    # we perform left join to base_df
    # the columns of elems[1] will be reused, use deepcopy if you want fresh columns
    base_df = copy(elems[1])
    ref_left = base_df[:REF]
    for i in 2:length(elems)
        df = elems[i]
        ref_right = df[:REF]
        for n in names(df)
            if n != on
                # this assumes that column names in all data frames except on are unique, otherwise they will be overwritten
                # we perform left join to the first DataFrame in elems
                base_df[n] = joiner(ref_left, ref_right, df[n])
            end
        end
    end
    base_df
end

以下是一个使用示例：

julia> left_join_sorted([df1, df2, df3], :REF)
5×4 DataFrames.DataFrame
│ Row │ REF │ D1       │ D2        │ D3       │
├─────┼─────┼──────────┼───────────┼──────────┤
│ 1   │ 1   │ 0.133361 │ 0.179822  │ 0.200842 │
│ 2   │ 2   │ 0.548581 │ 0.836018  │ 0.906814 │
│ 3   │ 3   │ 0.304062 │ 0.0797432 │ 0.946639 │
│ 4   │ 4   │ 0.755515 │ missing   │ 0.519437 │
│ 5   │ 5   │ 0.571302 │ missing   │ missing  │

作为附带好处，我的基准测试表明，这比使用DataVector要快20倍（如果你想进一步使用加速@inbounds，但可能带来的好处不值得冒险）。

编辑：joiner循环中的固定条件。

加入Julia中的数据框列表

2 个答案: