Deedle相当于pandas.merge

时间:2017-05-05 17:07:58

标签: f# f#-data deedle fslab

我希望以与pandas.DataFrame.Merge类似的方式基于每个帧中的特定列合并两个Deedle(F#)帧。这个的完美示例将是包含数据列和a的主要帧(city,state)列以及包含以下列的信息框架:(city,state);土地增值税;长。如果我想将lat长列添加到我的主框架中,我会合并(城市,州)列上的两个框架。

以下是一个例子:

    let primaryFrame =
            [(0, "Job Name", box "Job 1")
             (0, "City, State", box "Reno, NV")
             (1, "Job Name", box "Job 2")
             (1, "City, State", box "Portland, OR")
             (2, "Job Name", box "Job 3")
             (2, "City, State", box "Portland, OR")
             (3, "Job Name", box "Job 4")
             (3, "City, State", box "Sacramento, CA")] |> Frame.ofValues

    let infoFrame =
            [(0, "City, State", box "Reno, NV")
             (0, "Lat", box "Reno_NV_Lat")
             (0, "Long", box "Reno_NV_Long")
             (1, "City, State", box "Portland, OR")
             (1, "Lat", box "Portland_OR_Lat")
             (1, "Long", box "Portland_OR_Long")] |> Frame.ofValues

    // see code for merge_on below.
    let mergedFrame = primaryFrame
                      |> merge_On infoFrame "City, State" null

哪会导致' mergedFrame'看起来像这样:

> mergedFrame.Format();;
val it : string =
  "     Job Name City, State    Lat             Long             
0 -> Job 1    Reno, NV       Reno_NV_Lat     Reno_NV_Long     
1 -> Job 2    Portland, OR   Portland_OR_Lat Portland_OR_Long 
2 -> Job 3    Portland, OR   Portland_OR_Lat Portland_OR_Long 
3 -> Job 4    Sacramento, CA <missing>       <missing>   

我想出了一种方法(上面示例中使用的&merge;&#39;函数),但作为F#的新手,我想有一个更惯用的/这样做的有效方式。以下是我执行此操作的功能以及“removeDuplicateRows”和#39; removeDuplicateRows&#39;这可以满足您的期望,并且需要&#39; merge_on&#39;功能;如果你想评论一个更好的方法,请做。

    let removeDuplicateRows column (frame : Frame<'a, 'b>) =
             let nonDupKeys = frame.GroupRowsBy(column).RowKeys
                              |> Seq.distinctBy (fun (a, b) -> a) 
                              |> Seq.map (fun (a, b) -> b)  
             frame.Rows.[nonDupKeys]


    let merge_On (infoFrame : Frame<'c, 'b>) mergeOnCol missingReplacement 
                  (primaryFrame : Frame<'a,'b>) =
          let frame = primaryFrame.Clone() 
          let infoFrame =  infoFrame                           
                           |> removeDuplicateRows mergeOnCol 
                           |> Frame.indexRows mergeOnCol
          let initialSeries = frame.GetColumn(mergeOnCol)
          let infoFrameRows = infoFrame.RowKeys
          for colKey in infoFrame.ColumnKeys do
              let newSeries =
                  [for v in initialSeries.ValuesAll do
                        if Seq.contains v infoFrameRows then  
                            let key = infoFrame.GetRow(v)
                            yield key.[colKey]
                        else
                            yield box missingReplacement ]
              frame.AddColumn(colKey, newSeries)
          frame

感谢您的帮助!

更新:

将Frame.indexRowsString切换到Frame.indexRows以处理&#39; mergOnCol&#39;中的类型的情况。不是字符串。

按照Tomas

的建议摆脱了infoFrame.Clone()

1 个答案:

答案 0 :(得分:0)

Deedle加入帧(仅在行/列键中)的方式可悲地意味着它没有一个很好的内置函数来在非键列上连接帧。

据我所知,你的方法对我来说非常好。您Clone上不需要infoFrame(因为您没有改变框架),我认为您可以将infoFrame.GetRow替换为infoFrame.TryGetRow(然后您将不需要提前获得密钥),但除此之外,你的代码看起来很好!

我提出了一种替代方案,并采用了更短的方式,如下所示:

// Index the info frame by city/state, so that we can do lookup
let infoByCity = infoFrame |> Frame.indexRowsString "City, State"

// Create a new frame with the same row indices as 'primaryFrame' 
// containing the additional information from infoFrame.
let infoMatched = 
  primaryFrame.Rows
  |> Series.map (fun k row -> 
      // For every row, we get the "City, State" value of the row and then
      // find the corresponding row with additional information in infoFrame. Using 
      // 'ValueOrDefault' will automatically give missing when the key does not exist
      infoByCity.Rows.TryGet(row.GetAs<string>("City, State")).ValueOrDefault)
  // Now turn the series of rows into a frame
  |> Frame.ofRows

// Now we have two frames with matching keys, so we can join!
primaryFrame.Join(infoMatched)

这有点短,可能更加不言自明,但我没有做任何测试来检查哪个更快。除非性能是主要考虑因素,否则我认为使用更易读的版本是一个很好的默认选择!