特别是,我想要类似R::data.table
d[, function(...), by = key]
的内容。使用另一个Stackoverflow问题的答案(
Julia Dataframe group by and pivot tables functions)我有这个解决方案:
using DataFrames
df =DataFrame(Location = [ "NY", "SF", "NY", "NY", "SF", "SF", "TX", "TX", "TX", "DC"],
Class = ["H","L","H","L","L","H", "H","L","L","M"],
Address = ["12 Silver","10 Fak","12 Silver","1 North","10 Fak","2 Fake", "1 Red","1 Dog","2 Fake","1 White"],
Score = ["4","5","3","2","1","5","4","3","2","1"])
julia> by(df, :Location, d -> DataFrame(count=nrow(d)))
4x2 DataFrames.DataFrame
| Row | Location | count |
|-----|----------|-------|
| 1 | "DC" | 1 |
| 2 | "NY" | 3 |
| 3 | "SF" | 3 |
| 4 | "TX" | 3 |
工作正常,但对于大型数据集来说,结果非常慢。有没有更快的解决方案?
答案 0 :(得分:2)
对于计算,以下解决方案更快但不可读:
cmap = countmap(df[:Location]);
res = DataFrame(Location=collect(keys(cmap)),count=collect(values(cmap)))
或者,更一般地说(再次计算):
countdf(df::DataFrame, fld) =
( h = countmap(df[fld]) ; DataFrame(collect.([keys(h),values(h)]),[fld,:count]) )
,并提供:
julia> countdf(df,:Location)
4×2 DataFrames.DataFrame
│ Row │ Location │ count │
├─────┼──────────┼───────┤
│ 1 │ "DC" │ 1 │
│ 2 │ "SF" │ 3 │
│ 3 │ "NY" │ 3 │
│ 4 │ "TX" │ 3 │
对于其他聚合函数(可以按顺序计算),我们可以定义函数:
foldmap(op, v0, df, col) =
foldl((x,y)->setindex!(x,op(get(x,y[col],v0),y),y[col]),
Dict{eltype(df[col]),typeof(v0)}(), eachrow(df))
folddf(op, v0, df, col) =
(h = foldmap(op, v0, df, col) ;
DataFrame(collect.([keys(h),values(h)]),[col,:res]) )
inc1(x,y) = x+1
sumScore(x,y) = x+y[:Score]
maxScore(x,y) = max(x,y[:Score])
有了这些定义:
julia> eltype(df[:Score])<:Real || ( df[:Score] = parse.(Float64, df[:Score]) );
julia> foldmap(inc1, 0, df, :Location)
Dict{String,Int64} with 4 entries:
"DC" => 1
"SF" => 3
"NY" => 3
"TX" => 3
julia> folddf(sumScore, 0.0, df, :Location)
4×2 DataFrames.DataFrame
│ Row │ Location │ res │
├─────┼──────────┼──────┤
│ 1 │ "DC" │ 1.0 │
│ 2 │ "SF" │ 11.0 │
│ 3 │ "NY" │ 9.0 │
│ 4 │ "TX" │ 9.0 │
julia> folddf(maxScore, 0.0, df, :Location)
4×2 DataFrames.DataFrame
│ Row │ Location │ res │
├─────┼──────────┼─────┤
│ 1 │ "DC" │ 1.0 │
│ 2 │ "SF" │ 5.0 │
│ 3 │ "NY" │ 4.0 │
│ 4 │ "TX" │ 4.0 │