如何在朱莉娅手术中快速分组?

时间:2017-11-21 18:38:36

标签: julia

特别是,我想要类似R::data.table d[, function(...), by = key]的内容。使用另一个Stackoverflow问题的答案( Julia Dataframe group by and pivot tables functions)我有这个解决方案:

using DataFrames

df =DataFrame(Location = [ "NY", "SF", "NY", "NY", "SF", "SF", "TX", "TX", "TX", "DC"],
                 Class = ["H","L","H","L","L","H", "H","L","L","M"],
                 Address = ["12 Silver","10 Fak","12 Silver","1 North","10 Fak","2 Fake", "1 Red","1 Dog","2 Fake","1 White"],
                 Score = ["4","5","3","2","1","5","4","3","2","1"])


julia> by(df, :Location, d -> DataFrame(count=nrow(d)))
4x2 DataFrames.DataFrame
| Row | Location | count |
|-----|----------|-------|
| 1   | "DC"     | 1     |
| 2   | "NY"     | 3     |
| 3   | "SF"     | 3     |
| 4   | "TX"     | 3     |

工作正常,但对于大型数据集来说,结果非常慢。有没有更快的解决方案?

1 个答案:

答案 0 :(得分:2)

对于计算,以下解决方案更快但不可读:

cmap = countmap(df[:Location]); 
res = DataFrame(Location=collect(keys(cmap)),count=collect(values(cmap)))

或者,更一般地说(再次计算):

countdf(df::DataFrame, fld) = 
  ( h = countmap(df[fld]) ; DataFrame(collect.([keys(h),values(h)]),[fld,:count]) )

,并提供:

julia> countdf(df,:Location)
4×2 DataFrames.DataFrame
│ Row │ Location │ count │
├─────┼──────────┼───────┤
│ 1   │ "DC"     │ 1     │
│ 2   │ "SF"     │ 3     │
│ 3   │ "NY"     │ 3     │
│ 4   │ "TX"     │ 3     │

对于其他聚合函数(可以按顺序计算),我们可以定义函数:

foldmap(op, v0, df, col) = 
  foldl((x,y)->setindex!(x,op(get(x,y[col],v0),y),y[col]),
  Dict{eltype(df[col]),typeof(v0)}(), eachrow(df))
folddf(op, v0, df, col) = 
  (h = foldmap(op, v0, df, col) ; 
   DataFrame(collect.([keys(h),values(h)]),[col,:res]) )

inc1(x,y) = x+1
sumScore(x,y) = x+y[:Score]
maxScore(x,y) = max(x,y[:Score])

有了这些定义:

julia> eltype(df[:Score])<:Real || ( df[:Score] = parse.(Float64, df[:Score]) );

julia> foldmap(inc1, 0, df, :Location)
Dict{String,Int64} with 4 entries:
  "DC" => 1
  "SF" => 3
  "NY" => 3
  "TX" => 3

julia> folddf(sumScore, 0.0, df, :Location)
4×2 DataFrames.DataFrame
│ Row │ Location │ res  │
├─────┼──────────┼──────┤
│ 1   │ "DC"     │ 1.0  │
│ 2   │ "SF"     │ 11.0 │
│ 3   │ "NY"     │ 9.0  │
│ 4   │ "TX"     │ 9.0  │

julia> folddf(maxScore, 0.0, df, :Location)
4×2 DataFrames.DataFrame
│ Row │ Location │ res │
├─────┼──────────┼─────┤
│ 1   │ "DC"     │ 1.0 │
│ 2   │ "SF"     │ 5.0 │
│ 3   │ "NY"     │ 4.0 │
│ 4   │ "TX"     │ 4.0 │