R:如何在保留其他列的同时聚合某些列

时间:2017-11-21 16:35:23

标签: r

我遇到的问题与here类似,但我尝试过的解决方案都没有。

给出一个这样的表:

Date    Exercise    Category    Weight  Reps    EstMax  RepxWeight  Note
4/2/16  Deadlift    Legs    135 7   166.4685    7x135   easy
4/2/16  Deadlift    Legs    135 7   166.4685    7x135   kinda easy
4/2/16  Deadlift    Legs    135 7   166.4685    7x135   tired
4/2/16  Bench Press Chest   95  5   110.8175    5x95    hard
4/2/16  Bench Press Chest   135 2   143.991 2x135   not hard
4/9/16  Bench Press Chest   135 2   143.991 2x135   a little hard
4/9/16  Bench Press Chest   135 2   143.991 2x135   super tired
4/18/16 Deadlift    Legs    155 8   196.292 8x155   …
4/18/16 Deadlift    Legs    155 5   180.8075    5x155   bad day
5/8/16  Deadlift    Legs    185 3   203.4815    3x185   good day
5/8/16  Deadlift    Legs    185 3   203.4815    3x185   felt easy
5/8/16  Bench Press Chest   115 4   130.318 4x115   easy
5/8/16  Bench Press Chest   115 4   130.318 4x115   hard

我希望aggregate根据多个其他列(例如max和{{}获取某个列的EstMax值(例如Date) {1}}),但也保留行中的所有其他列。如果多个条目具有相同的最大值,请取第一个条目。

预期输出如下:

Exercise

我试过的一些方法的例子;在每种情况下,'额外列'最终都被用作聚合的因素,这不是我想要的。

Date    Exercise    Category    Weight  Reps    EstMax  RepxWeight  Note
4/2/16  Deadlift    Legs    135 7   166.4685    7x135   easy
4/2/16  Bench Press Chest   135 2   143.991 2x135   not hard
4/9/16  Bench Press Chest   135 2   143.991 2x135   a little hard
4/18/16 Deadlift    Legs    155 8   196.292 8x155   …
5/8/16  Deadlift    Legs    185 3   203.4815    3x185   good day
5/8/16  Bench Press Chest   115 4   130.318 4x115   hard

特别喜欢碱性R溶液。还看到了data <- structure(list(Date = structure(c(2L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 4L, 4L, 4L, 4L), .Label = c("4/18/16", "4/2/16", "4/9/16", "5/8/16"), class = "factor"), Exercise = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("Bench Press", "Deadlift"), class = "factor"), Category = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("Chest", "Legs"), class = "factor"), Weight = c(135L, 135L, 135L, 95L, 135L, 135L, 135L, 155L, 155L, 185L, 185L, 115L, 115L), Reps = c(7L, 7L, 7L, 5L, 2L, 2L, 2L, 8L, 5L, 3L, 3L, 4L, 4L), EstMax = c(166.4685, 166.4685, 166.4685, 110.8175, 143.991, 143.991, 143.991, 196.292, 180.8075, 203.4815, 203.4815, 130.318, 130.318), RepxWeight = structure(c(6L, 6L, 6L, 5L, 1L, 1L, 1L, 7L, 4L, 2L, 2L, 3L, 3L), .Label = c("2x135", "3x185", "4x115", "5x155", "5x95", "7x135", "8x155"), class = "factor"), Note = structure(c(4L, 8L, 11L, 7L, 9L, 2L, 10L, 1L, 3L, 6L, 5L, 4L, 7L), .Label = c("…", "a little hard", "bad day", "easy", "felt easy", "good day", "hard", "kinda easy", "not hard", "super tired", "tired"), class = "factor")), .Names = c("Date", "Exercise", "Category", "Weight", "Reps", "EstMax", "RepxWeight", "Note"), class = "data.frame", row.names = c(NA, -13L)) # base R aggregate(EstMax ~ Date + Exercise, data = data, FUN = max) # Date Exercise EstMax # 1 4/2/16 Bench Press 143.9910 # 2 4/9/16 Bench Press 143.9910 # 3 5/8/16 Bench Press 130.3180 # 4 4/18/16 Deadlift 196.2920 # 5 4/2/16 Deadlift 166.4685 # 6 5/8/16 Deadlift 203.4815 aggregate(EstMax ~ Date + Exercise + RepxWeight + Note, data = data, FUN = max) # Date Exercise RepxWeight Note EstMax # 1 4/18/16 Deadlift 8x155 … 196.2920 # 2 4/9/16 Bench Press 2x135 a little hard 143.9910 # 3 4/18/16 Deadlift 5x155 bad day 180.8075 # 4 5/8/16 Bench Press 4x115 easy 130.3180 # 5 4/2/16 Deadlift 7x135 easy 166.4685 # 6 5/8/16 Deadlift 3x185 felt easy 203.4815 # 7 5/8/16 Deadlift 3x185 good day 203.4815 # 8 5/8/16 Bench Press 4x115 hard 130.3180 # 9 4/2/16 Bench Press 5x95 hard 110.8175 # 10 4/2/16 Deadlift 7x135 kinda easy 166.4685 # 11 4/2/16 Bench Press 2x135 not hard 143.9910 # 12 4/9/16 Bench Press 2x135 super tired 143.9910 # 13 4/2/16 Deadlift 7x135 tired 166.4685 # data table library("data.table") data_dt <- data.table(data) data_dt[ , max(EstMax), by = c("Date", "Exercise")] # Date Exercise V1 # 1: 4/2/16 Deadlift 166.4685 # 2: 4/2/16 Bench Press 143.9910 # 3: 4/9/16 Bench Press 143.9910 # 4: 4/18/16 Deadlift 196.2920 # 5: 5/8/16 Deadlift 203.4815 # 6: 5/8/16 Bench Press 130.3180 data_dt[, max(EstMax), .(Date, Exercise, Weight, Reps, RepxWeight, Note)] # Date Exercise Weight Reps RepxWeight Note V1 # 1: 4/2/16 Deadlift 135 7 7x135 easy 166.4685 # 2: 4/2/16 Deadlift 135 7 7x135 kinda easy 166.4685 # 3: 4/2/16 Deadlift 135 7 7x135 tired 166.4685 # 4: 4/2/16 Bench Press 95 5 5x95 hard 110.8175 # 5: 4/2/16 Bench Press 135 2 2x135 not hard 143.9910 # 6: 4/9/16 Bench Press 135 2 2x135 a little hard 143.9910 # 7: 4/9/16 Bench Press 135 2 2x135 super tired 143.9910 # 8: 4/18/16 Deadlift 155 8 8x155 … 196.2920 # 9: 4/18/16 Deadlift 155 5 5x155 bad day 180.8075 # 10: 5/8/16 Deadlift 185 3 3x185 good day 203.4815 # 11: 5/8/16 Deadlift 185 3 3x185 felt easy 203.4815 # 12: 5/8/16 Bench Press 115 4 4x115 easy 130.3180 # 13: 5/8/16 Bench Press 115 4 4x115 hard 130.3180 函数,该函数可能会有所帮助,但无法弄清楚如何将其应用于此。

我看过的其他相关问题却没有解决这个问题:

Adding a non-aggregated column to an aggregated data set based on the aggregation of another column

Only keep min value for each factor level

How to select the row with the maximum value in each group

aggregating multiple columns in data.table

How to aggregate some columns while keeping other columns in R?

4 个答案:

答案 0 :(得分:8)

我知道您寻求基本的R解决方案,但与此同时,这里有一个dplyr

library(dplyr)

data %>% 
  group_by(Date, Exercise) %>% 
  slice(which.max(EstMax))

# # A tibble: 6 x 8
# # Groups:   Date, Exercise [6]
#      Date    Exercise Category Weight  Reps   EstMax RepxWeight          Note
#    <fctr>      <fctr>   <fctr>  <int> <int>    <dbl>     <fctr>        <fctr>
# 1 4/18/16    Deadlift     Legs    155     8 196.2920      8x155             …
# 2  4/2/16 Bench Press    Chest    135     2 143.9910      2x135      not hard
# 3  4/2/16    Deadlift     Legs    135     7 166.4685      7x135          easy
# 4  4/9/16 Bench Press    Chest    135     2 143.9910      2x135 a little hard
# 5  5/8/16 Bench Press    Chest    115     4 130.3180      4x115          easy
# 6  5/8/16    Deadlift     Legs    185     3 203.4815      3x185      good day

修改

data.table不是我的 forte ,但为了完整起见,我的尝试是:

library(data.table)

setDT(data)[, .SD[which.max(EstMax)], by = .(Date, Exercise)]

#       Date    Exercise Category Weight Reps   EstMax RepxWeight          Note
# 1:  4/2/16    Deadlift     Legs    135    7 166.4685      7x135          easy
# 2:  4/2/16 Bench Press    Chest    135    2 143.9910      2x135      not hard
# 3:  4/9/16 Bench Press    Chest    135    2 143.9910      2x135 a little hard
# 4: 4/18/16    Deadlift     Legs    155    8 196.2920      8x155             …
# 5:  5/8/16    Deadlift     Legs    185    3 203.4815      3x185      good day
# 6:  5/8/16 Bench Press    Chest    115    4 130.3180      4x115          easy

答案 1 :(得分:2)

这是dplyr的另一种方法:

library(dplyr)
library(lubridate)

data %>%
  mutate(Date = mdy(Date)) %>%
  group_by(Date, Exercise) %>%
  arrange(desc(EstMax)) %>%
  slice(1)

<强>结果:

# A tibble: 6 x 8
# Groups:   Date, Exercise [6]
        Date    Exercise Category Weight  Reps   EstMax RepxWeight          Note
      <date>      <fctr>   <fctr>  <int> <int>    <dbl>     <fctr>        <fctr>
1 2016-04-02 Bench Press    Chest    135     2 143.9910      2x135      not hard
2 2016-04-02    Deadlift     Legs    135     7 166.4685      7x135          easy
3 2016-04-09 Bench Press    Chest    135     2 143.9910      2x135 a little hard
4 2016-04-18    Deadlift     Legs    155     8 196.2920      8x155             …
5 2016-05-08 Bench Press    Chest    115     4 130.3180      4x115          easy
6 2016-05-08    Deadlift     Legs    185     3 203.4815      3x185      good day

或者您也可以使用sqldf

library(sqldf)
library(lubridate)

data$Date = mdy(data$Date)

sqldf("select *, max(EstMax) as EstMax2 from data
        group by Date, Exercise
        order by Date, Exercise")

<强>结果:

        Date    Exercise Category Weight Reps   EstMax RepxWeight          Note  EstMax2
1 2016-04-02 Bench Press    Chest    135    2 143.9910      2x135      not hard 143.9910
2 2016-04-02    Deadlift     Legs    135    7 166.4685      7x135          easy 166.4685
3 2016-04-09 Bench Press    Chest    135    2 143.9910      2x135 a little hard 143.9910
4 2016-04-18    Deadlift     Legs    155    8 196.2920      8x155             … 196.2920
5 2016-05-08 Bench Press    Chest    115    4 130.3180      4x115          easy 130.3180
6 2016-05-08    Deadlift     Legs    185    3 203.4815      3x185      good day 203.4815

答案 2 :(得分:2)

一个(不正确的)方法,为了显示一个问题而独立汇总所有数字列:

grpvar <- c("Date", "Exercise", "Category")
merge(
  aggregate(data[,c("Weight", "Reps", "EstMax")], by = data[grpvar], FUN = max),
  aggregate(data[,c("RepxWeight", "Note")], by = data[grpvar], FUN = function(a) a[1]),
  by = grpvar
)
#      Date    Exercise Category Weight Reps   EstMax RepxWeight          Note
# 1 4/18/16    Deadlift     Legs    155    8 196.2920      8x155           ...
# 2  4/2/16 Bench Press    Chest    135    5 143.9910       5x95          hard
# 3  4/2/16    Deadlift     Legs    135    7 166.4685      7x135          easy
# 4  4/9/16 Bench Press    Chest    135    2 143.9910      2x135 a little hard
# 5  5/8/16 Bench Press    Chest    115    4 130.3180      4x115          easy
# 6  5/8/16    Deadlift     Legs    185    3 203.4815      3x185      good day

4/2/16上,您的卧推显示最大重量为135,最大重复次数为5,但两者并未出现在数据的同一行。

这是一种稍微(更正确)的不同方法,使用您对which.max的想法:

do.call(rbind,
        by(data, data[c("Date", "Exercise")],
           function(x) x[which.max(x$Weight),])
        )
#       Date    Exercise Category Weight Reps   EstMax RepxWeight          Note
# 5   4/2/16 Bench Press    Chest    135    2 143.9910      2x135      not hard
# 6   4/9/16 Bench Press    Chest    135    2 143.9910      2x135 a little hard
# 12  5/8/16 Bench Press    Chest    115    4 130.3180      4x115          easy
# 8  4/18/16    Deadlift     Legs    155    8 196.2920      8x155           ...
# 1   4/2/16    Deadlift     Legs    135    7 166.4685      7x135          easy
# 10  5/8/16    Deadlift     Legs    185    3 203.4815      3x185      good day

如果出于某种原因,可能在一个Exercise内有一个Category,您可能希望by的第二个参数改为data[c("Date","Exercise","Category")]。< / p>

(您可以使用类似x[order(as.Date(x$Date, format="%m/%d/%Y")),]的内容订购输出...实际上您可能认为$Date列是实际的Date - 类。)

答案 3 :(得分:1)

我知道你更喜欢基础R解决方案,但dplyr提供了一个功能&#39; top_n&#39;这正是你所要求的。

使用它一次来检索所有EstMax实例:

library(dplyr)

data %>%
  group_by(Exercise) %>%
  top_n(1, EstMax)

# A tibble: 5 x 8
# Groups:   Exercise [2]
    Date    Exercise Category Weight  Reps   EstMax RepxWeight          Note
  <fctr>      <fctr>   <fctr>  <int> <int>    <dbl>     <fctr>        <fctr>
1 4/2/16 Bench Press    Chest    135     2 143.9910      2x135      not hard
2 4/9/16 Bench Press    Chest    135     2 143.9910      2x135 a little hard
3 4/9/16 Bench Press    Chest    135     2 143.9910      2x135   super tired
4 5/8/16    Deadlift     Legs    185     3 203.4815      3x185      good day
5 5/8/16    Deadlift     Legs    185     3 203.4815      3x185     felt easy

使用它两次来检索最大结果的第一个结果:

data %>%
  group_by(Exercise) %>%
  top_n(1, EstMax) %>%
  top_n(1, Date)

Selecting by Note
# A tibble: 2 x 8
# Groups:   Exercise [2]
    Date    Exercise Category Weight  Reps   EstMax RepxWeight        Note
  <fctr>      <fctr>   <fctr>  <int> <int>    <dbl>     <fctr>      <fctr>
1 4/9/16 Bench Press    Chest    135     2 143.9910      2x135 super tired
2 5/8/16    Deadlift     Legs    185     3 203.4815      3x185    good day

请注意,这是第一个结果,不一定是最早的日期。因此,您必须在使用第二个&#39; top_n&#39;:

之前按日期排列
data %>%
  group_by(Exercise) %>%
  top_n(1, EstMax) %>%
  mutate(Date = as.Date(Date, format = '%d/%m/%y')) %>%
  arrange(Date) %>%
  top_n(1)

Selecting by Note
# A tibble: 2 x 8
# Groups:   Exercise [2]
        Date    Exercise Category Weight  Reps   EstMax RepxWeight        Note
      <date>      <fctr>   <fctr>  <int> <int>    <dbl>     <fctr>      <fctr>
1 2016-09-04 Bench Press    Chest    135     2 143.9910      2x135 super tired
2 2016-08-05    Deadlift     Legs    185     3 203.4815      3x185    good day

[edit]稍微误读了这个问题,这是一个解决方案,它提供了你要求的输出:

data %>%
  group_by(Date, Exercise) %>%
  top_n(1, EstMax) %>%
  top_n(1)

Selecting by Note
# A tibble: 6 x 8
# Groups:   Date, Exercise [6]
     Date    Exercise Category Weight  Reps   EstMax RepxWeight        Note
   <fctr>      <fctr>   <fctr>  <int> <int>    <dbl>     <fctr>      <fctr>
1  4/2/16    Deadlift     Legs    135     7 166.4685      7x135       tired
2  4/2/16 Bench Press    Chest    135     2 143.9910      2x135    not hard
3  4/9/16 Bench Press    Chest    135     2 143.9910      2x135 super tired
4 4/18/16    Deadlift     Legs    155     8 196.2920      8x155           …
5  5/8/16    Deadlift     Legs    185     3 203.4815      3x185    good day
6  5/8/16 Bench Press    Chest    115     4 130.3180      4x115        hard