以半整形/半聚合的方式清理data.frame

时间:2018-11-02 20:07:49

标签: r dplyr tidyr

第一次在此处发布内容,请原谅我的问题。

在下面的示例中,我有一个data.frame,其中的唯一标识符是tripID,上面有容器的名称,种类代码和捕获指标。

> testFrame1 <- data.frame('tripID' = c(1,1,2,2,3,4,5), 
                           'name' = c('SS Anne','SS Anne', 'HMS Endurance', 'HMS Endurance','Salty Hippo', 'Seagallop', 'Borealis'), 
                           'SPP' = c(101,201,101,201,102,102,103), 
                           'kept' = c(12, 22, 14, 24, 16, 18, 10))
> testFrame1
    tripID          name SPP kept
  1      1       SS Anne 101   12
  2      1       SS Anne 201   22
  3      2 HMS Endurance 101   14
  4      2 HMS Endurance 201   24
  5      3   Salty Hippo 102   16
  6      4     Seagallop 102   18
  7      5      Borealis 103   10

我需要一种基本压缩data.frame的方法,以便每个tripID仅存在一行,如下所示。

> testFrame1
    tripID          name SPP kept SPP.1 kept.1
  1      1       SS Anne 101   12   201     22
  2      2 HMS Endurance 101   14   201     24
  3      3   Salty Hippo 102   16    NA     NA
  4      4     Seagallop 102   18    NA     NA
  5      5      Borealis 103   10    NA     NA

我已经研究了tidyrreshape,但这些都不能满足我的要求。是否有任何东西可以进行这种准整形?

2 个答案:

答案 0 :(得分:1)

以下是使用base::reshapedata.table::dcast的两种选择:

1)基本R

reshape(transform(testFrame1,
                  timevar = ave(tripID, tripID, FUN = seq_along)),
        idvar = cbind("tripID", "name"),
        timevar = "timevar",
        direction = "wide")
#  tripID          name SPP.1 kept.1 SPP.2 kept.2
#1      1       SS Anne   101     12   201     22
#3      2 HMS Endurance   101     14   201     24
#5      3   Salty Hippo   102     16    NA     NA
#6      4     Seagallop   102     18    NA     NA
#7      5      Borealis   103     10    NA     NA

2)data.table

library(data.table)
setDT(testFrame1)
dcast(testFrame1, tripID + name ~ rowid(tripID), value.var = c("SPP", "kept"))
#   tripID          name SPP_1 SPP_2 kept_1 kept_2
#1:      1       SS Anne   101   201     12     22
#2:      2 HMS Endurance   101   201     14     24
#3:      3   Salty Hippo   102    NA     16     NA
#4:      4     Seagallop   102    NA     18     NA
#5:      5      Borealis   103    NA     10     NA

答案 1 :(得分:0)

伟大的可复制帖子,因为这是您的第一篇。这是使用dplyrtidyr的一种方法-

testFrame1 %>%
  group_by(tripID, name) %>%
  summarise(
    SPP = toString(SPP),
    kept = toString(kept)
  ) %>%
  ungroup() %>%
  separate("SPP", into = c("SPP", "SPP.1"), sep = ", ", extra = "drop", fill = "right") %>%
  separate("kept", into = c("kept", "kept.1"), sep = ", ", extra = "drop", fill = "right")

# A tibble: 5 x 6
  tripID name          SPP   SPP.1 kept  kept.1
   <dbl> <chr>         <chr> <chr> <chr> <chr> 
1   1.00 SS Anne       101   201   12    22    
2   2.00 HMS Endurance 101   201   14    24    
3   3.00 Salty Hippo   102   <NA>  16    <NA>  
4   4.00 Seagallop     102   <NA>  18    <NA>  
5   5.00 Borealis      103   <NA>  10    <NA>