dplyr分组,并进行汇总

时间:2020-10-06 00:30:49

标签: r dplyr

下面的示例数据...

我想统计每月计算的“类型”点的数量(类型是运输船)。因此,首先,我想总结一下每个月总计计有多少“类型”船只。例如六月有5个渔船点数。

最好使用dplyr:

我有类似的东西:

dfsum <- df  %>% group_by(Month, Type) %>% tally()

但是,尽管效果很好,但我还要通过唯一的船舶ID来做上述工作-船舶每月可以拥有多个点,但是我想知道每月有多少艘独特的船舶。

我可以按ID添加组:

dfsum2 <- df  %>% group_by(Month, id,Type) %>% tally()

但是,这不太整洁,使用较大的数据集将很难编译-而是我希望结果是2月份有2个唯一的渔船(使用此数据示例)-是否有更好的方法来提取此数据信息吗?

所需的输出:

Month   Type      n
Jan     Fishing  x
Feb     Fishing  x
Feb     Sailing  x
March   Fishing  x

其中x是该月该类别中按ID排列的唯一船只的数量或计数。

#虚拟数据

df<- structure(list(UTC_Time = structure(c(1L, 1L, 1L, 1L, 339L, 339L, 
339L, 68L, 68L, 68L, 154L, 154L, 154L, 154L, 154L, 154L, 14L, 
14L, 14L, 14L, 14L, 15L, 50L, 50L, 51L, 51L, 51L, 51L, 51L, 51L, 
51L, 51L, 51L, 77L, 146L, 147L, 147L, 147L, 147L, 147L, 148L, 
148L), .Label = c("2018-01-01 0:00:00", "2018-01-02 0:00:00", 
"2018-01-03 0:00:00", "2018-01-04 0:00:00", "2018-01-05 0:00:00", 
"2018-01-06 0:00:00", "2018-01-07 0:00:00", "2018-01-08 0:00:00", 
"2018-01-09 0:00:00", "2018-01-10 0:00:00", "2018-01-11 0:00:00", 
"2018-01-12 0:00:00", "2018-01-13 0:00:00", "2018-01-14 0:00:00", 
"2018-01-15 0:00:00", "2018-01-16 0:00:00", "2018-01-17 0:00:00", 
"2018-01-18 0:00:00", "2018-01-19 0:00:00", "2018-01-20 0:00:00", 
"2018-01-21 0:00:00", "2018-01-22 0:00:00", "2018-01-23 0:00:00", 
"2018-01-24 0:00:00", "2018-01-25 0:00:00", "2018-01-26 0:00:00", 
"2018-01-27 0:00:00", "2018-01-28 0:00:00", "2018-01-29 0:00:00", 
"2018-01-30 0:00:00", "2018-01-31 0:00:00", "2018-02-01 0:00:00", 
"2018-02-02 0:00:00", "2018-02-03 0:00:00", "2018-02-04 0:00:00", 
"2018-02-05 0:00:00", "2018-02-06 0:00:00", "2018-02-07 0:00:00", 
"2018-02-08 0:00:00", "2018-02-09 0:00:00", "2018-02-10 0:00:00", 
"2018-02-11 0:00:00", "2018-02-12 0:00:00", "2018-02-13 0:00:00", 
"2018-02-14 0:00:00", "2018-02-15 0:00:00", "2018-02-16 0:00:00", 
"2018-02-17 0:00:00", "2018-02-18 0:00:00", "2018-02-19 0:00:00", 
"2018-02-20 0:00:00", "2018-02-21 0:00:00", "2018-02-22 0:00:00", 
"2018-02-23 0:00:00", "2018-02-24 0:00:00", "2018-02-25 0:00:00", 
"2018-02-26 0:00:00", "2018-02-27 0:00:00", "2018-02-28 0:00:00", 
 "2018-03-01 0:00:00", "2018-03-02 0:00:00", "2018-03-03 0:00:00", 
"2018-03-04 0:00:00", "2018-03-05 0:00:00", "2018-03-06 0:00:00", 
"2018-03-07 0:00:00", "2018-03-08 0:00:00", "2018-03-09 0:00:00", 
"2018-03-10 0:00:00", "2018-03-11 0:00:00", "2018-03-12 0:00:00", 
"2018-03-13 0:00:00", "2018-03-14 0:00:00", "2018-03-15 0:00:00", 
"2018-03-16 0:00:00", "2018-03-17 0:00:00", "2018-03-18 0:00:00", 
"2018-03-19 0:00:00", "2018-03-20 0:00:00", "2018-03-21 0:00:00", 
"2018-03-22 0:00:00", "2018-03-23 0:00:00", "2018-03-24 0:00:00", 
"2018-03-25 0:00:00", "2018-03-26 0:00:00", "2018-03-27 0:00:00", 
"2018-03-28 0:00:00", "2018-03-29 0:00:00", "2018-03-30 0:00:00", 
"2018-03-31 0:00:00", "2018-04-01 0:00:00", "2018-04-02 0:00:00", 
"2018-04-03 0:00:00", "2018-04-04 0:00:00", "2018-04-05 0:00:00", 
"2018-04-06 0:00:00", "2018-04-07 0:00:00", "2018-04-08 0:00:00", 
 "2018-04-09 0:00:00", "2018-04-10 0:00:00", "2018-04-11 0:00:00", 
"2018-04-12 0:00:00", "2018-04-13 0:00:00", "2018-04-14 0:00:00", 
"2018-04-15 0:00:00", "2018-04-16 0:00:00", "2018-04-17 0:00:00", 
"2018-04-18 0:00:00", "2018-04-19 0:00:00", "2018-04-20 0:00:00", 
"2018-04-21 0:00:00", "2018-04-22 0:00:00", "2018-04-23 0:00:00", 
"2018-04-24 0:00:00", "2018-04-25 0:00:00", "2018-04-26 0:00:00", 
 "2018-04-27 0:00:00", "2018-04-28 0:00:00", "2018-04-29 0:00:00", 
"2018-04-30 0:00:00", "2018-05-01 0:00:00", "2018-05-02 0:00:00", 
"2018-05-03 0:00:00", "2018-05-04 0:00:00", "2018-05-05 0:00:00", 
"2018-05-06 0:00:00", "2018-05-07 0:00:00", "2018-05-08 0:00:00", 
"2018-05-09 0:00:00", "2018-05-10 0:00:00", "2018-05-11 0:00:00", 
"2018-05-12 0:00:00", "2018-05-13 0:00:00", "2018-05-14 0:00:00", 
"2018-05-15 0:00:00", "2018-05-16 0:00:00", "2018-05-17 0:00:00", 
"2018-05-18 0:00:00", "2018-05-19 0:00:00", "2018-05-20 0:00:00", 
"2018-05-21 0:00:00", "2018-05-22 0:00:00", "2018-05-23 0:00:00", 
"2018-05-24 0:00:00", "2018-05-25 0:00:00", "2018-05-26 0:00:00", 
"2018-05-27 0:00:00", "2018-05-28 0:00:00", "2018-05-29 0:00:00", 
"2018-05-30 0:00:00", "2018-05-31 0:00:00", "2018-06-01 0:00:00", 
"2018-06-02 0:00:00", "2018-06-03 0:00:00", "2018-06-04 0:00:00", 
"2018-06-05 0:00:00", "2018-06-06 0:00:00", "2018-06-07 0:00:00", 
"2018-06-08 0:00:00", "2018-06-09 0:00:00", "2018-06-10 0:00:00", 
"2018-06-11 0:00:00", "2018-06-12 0:00:00", "2018-06-13 0:00:00", 
"2018-06-14 0:00:00", "2018-06-15 0:00:00", "2018-06-16 0:00:00", 
"2018-06-17 0:00:00", "2018-06-18 0:00:00", "2018-06-19 0:00:00", 
"2018-06-20 0:00:00", "2018-06-21 0:00:00", "2018-06-22 0:00:00", 
"2018-06-23 0:00:00", "2018-06-24 0:00:00", "2018-06-25 0:00:00", 
"2018-06-26 0:00:00", "2018-06-27 0:00:00", "2018-06-28 0:00:00", 
"2018-06-29 0:00:00", "2018-06-30 0:00:00", "2018-07-01 0:00:00", 
"2018-07-02 0:00:00", "2018-07-03 0:00:00", "2018-07-04 0:00:00", 
"2018-07-05 0:00:00", "2018-07-06 0:00:00", "2018-07-07 0:00:00", 
"2018-07-08 0:00:00", "2018-07-09 0:00:00", "2018-07-10 0:00:00", 
"2018-07-11 0:00:00", "2018-07-12 0:00:00", "2018-07-13 0:00:00", 
"2018-07-14 0:00:00", "2018-07-15 0:00:00", "2018-07-16 0:00:00", 
 "2018-07-17 0:00:00", "2018-07-18 0:00:00", "2018-07-19 0:00:00", 
"2018-07-20 0:00:00", "2018-07-21 0:00:00", "2018-07-22 0:00:00", 
 "2018-07-23 0:00:00", "2018-07-24 0:00:00", "2018-07-25 0:00:00", 
"2018-07-26 0:00:00", "2018-07-27 0:00:00", "2018-07-28 0:00:00", 
"2018-07-29 0:00:00", "2018-07-30 0:00:00", "2018-07-31 0:00:00", 
"2018-08-01 0:00:00", "2018-08-02 0:00:00", "2018-08-03 0:00:00", 
 "2018-08-04 0:00:00", "2018-08-05 0:00:00", "2018-08-06 0:00:00", 
 "2018-08-07 0:00:00", "2018-08-08 0:00:00", "2018-08-09 0:00:00", 
"2018-08-10 0:00:00", "2018-08-11 0:00:00", "2018-08-12 0:00:00", 
 "2018-08-13 0:00:00", "2018-08-14 0:00:00", "2018-08-15 0:00:00", 
"2018-08-16 0:00:00", "2018-08-17 0:00:00", "2018-08-18 0:00:00", 
"2018-08-19 0:00:00", "2018-08-20 0:00:00", "2018-08-21 0:00:00", 
"2018-08-22 0:00:00", "2018-08-23 0:00:00", "2018-08-24 0:00:00", 
"2018-08-25 0:00:00", "2018-08-26 0:00:00", "2018-08-27 0:00:00", 
"2018-08-28 0:00:00", "2018-08-29 0:00:00", "2018-08-30 0:00:00", 
"2018-08-31 0:00:00", "2018-09-01 0:00:00", "2018-09-02 0:00:00", 
"2018-09-03 0:00:00", "2018-09-04 0:00:00", "2018-09-05 0:00:00", 
"2018-09-06 0:00:00", "2018-09-07 0:00:00", "2018-09-08 0:00:00", 
"2018-09-09 0:00:00", "2018-09-10 0:00:00", "2018-09-11 0:00:00", 
"2018-09-12 0:00:00", "2018-09-13 0:00:00", "2018-09-14 0:00:00", 
"2018-09-15 0:00:00", "2018-09-16 0:00:00", "2018-09-17 0:00:00", 
"2018-09-18 0:00:00", "2018-09-19 0:00:00", "2018-09-20 0:00:00", 
"2018-09-21 0:00:00", "2018-09-22 0:00:00", "2018-09-23 0:00:00", 
 "2018-09-24 0:00:00", "2018-09-25 0:00:00", "2018-09-26 0:00:00", 
 "2018-09-27 0:00:00", "2018-09-28 0:00:00", "2018-09-29 0:00:00", 
 "2018-09-30 0:00:00", "2018-10-01 0:00:00", "2018-10-02 0:00:00", 
  "2018-10-03 0:00:00", "2018-10-04 0:00:00", "2018-10-05 0:00:00", 
  "2018-10-06 0:00:00", "2018-10-07 0:00:00", "2018-10-08 0:00:00", 
  "2018-10-09 0:00:00", "2018-10-10 0:00:00", "2018-10-11 0:00:00", 
  "2018-10-12 0:00:00", "2018-10-13 0:00:00", "2018-10-14 0:00:00", 
  "2018-10-15 0:00:00", "2018-10-16 0:00:00", "2018-10-17 0:00:00", 
 "2018-10-18 0:00:00", "2018-10-19 0:00:00", "2018-10-20 0:00:00", 
  "2018-10-21 0:00:00", "2018-10-22 0:00:00", "2018-10-23 0:00:00", 
 "2018-10-24 0:00:00", "2018-10-25 0:00:00", "2018-10-26 0:00:00", 
 "2018-10-27 0:00:00", "2018-10-28 0:00:00", "2018-10-29 0:00:00", 
"2018-10-30 0:00:00", "2018-10-31 0:00:00", "2018-11-01 0:00:00", 
"2018-11-02 0:00:00", "2018-11-03 0:00:00", "2018-11-04 0:00:00", 
"2018-11-05 0:00:00", "2018-11-06 0:00:00", "2018-11-07 0:00:00", 
"2018-11-08 0:00:00", "2018-11-09 0:00:00", "2018-11-10 0:00:00", 
"2018-11-11 0:00:00", "2018-11-12 0:00:00", "2018-11-13 0:00:00", 
"2018-11-14 0:00:00", "2018-11-15 0:00:00", "2018-11-16 0:00:00", 
"2018-11-17 0:00:00", "2018-11-18 0:00:00", "2018-11-19 0:00:00", 
"2018-11-20 0:00:00", "2018-11-21 0:00:00", "2018-11-22 0:00:00", 
"2018-11-23 0:00:00", "2018-11-24 0:00:00", "2018-11-25 0:00:00", 
"2018-11-26 0:00:00", "2018-11-27 0:00:00", "2018-11-28 0:00:00", 
"2018-11-29 0:00:00", "2018-11-30 0:00:00", "2018-12-01 0:00:00", 
"2018-12-02 0:00:00", "2018-12-03 0:00:00", "2018-12-04 0:00:00", 
"2018-12-05 0:00:00", "2018-12-06 0:00:00", "2018-12-07 0:00:00", 
"2018-12-08 0:00:00", "2018-12-09 0:00:00", "2018-12-10 0:00:00", 
"2018-12-11 0:00:00", "2018-12-12 0:00:00", "2018-12-13 0:00:00", 
"2018-12-14 0:00:00", "2018-12-15 0:00:00", "2018-12-16 0:00:00", 
"2018-12-17 0:00:00", "2018-12-18 0:00:00", "2018-12-19 0:00:00", 
"2018-12-20 0:00:00", "2018-12-21 0:00:00", "2018-12-22 0:00:00", 
"2018-12-23 0:00:00", "2018-12-24 0:00:00", "2018-12-25 0:00:00", 
 "2018-12-26 0:00:00", "2018-12-27 0:00:00", "2018-12-28 0:00:00", 
"2018-12-29 0:00:00", "2018-12-30 0:00:00", "2018-12-31 0:00:00", 
"2019-01-01 0:00:00"), class = "factor"), Type = structure(c(4L, 
4L, 4L, 4L, 4L, 4L, 4L, 17L, 17L, 17L, 4L, 12L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 17L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L), .Label = c("Cargo ship", 
 "Cargo ship:DG,HS,MP(OS)", "Cargo ship:DG,HS,MP(X)", "Fishing", 
   "Law enforcement", "Local ship", "Passenger ship", "Passenger ship:DG,HS,MP(OS)", 
 "Passenger ship:DG,HS,MP(Y)", "Pilot", "Pleasure Craft", "Sailing", 
 "Search/rescue", "Ship", "Towing", "Towing(200/25)", "Tug"), class = "factor"), 
Month = structure(c(5L, 5L, 5L, 5L, 3L, 3L, 3L, 8L, 8L, 8L, 
7L, 7L, 7L, 7L, 7L, 7L, 5L, 5L, 5L, 5L, 5L, 5L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 
9L, 9L), .Label = c("Apr", "Aug", "Dec", "Feb", "Jan", "Jul", 
"Jun", "Mar", "May", "Nov", "Oct", "Sep"), class = "factor"), 
id = c(27L, 27L, 27L, 27L, 21L, 21L, 21L, 24L, 24L, 24L, 
20L, 6L, 20L, 20L, 20L, 20L, 48L, 48L, 48L, 48L, 48L, 42L, 
34L, 34L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 23L, 
17L, 17L, 17L, 14L, 14L, 3L, 14L, 3L)), row.names = c(1L, 
2L, 3L, 4L, 650L, 651L, 652L, 262L, 263L, 264L, 400L, 401L, 402L, 
403L, 404L, 405L, 100L, 101L, 102L, 103L, 104L, 105L, 250L, 251L, 
252L, 253L, 254L, 255L, 256L, 257L, 258L, 259L, 260L, 300L, 301L, 
302L, 303L, 304L, 305L, 306L, 307L, 308L), class = "data.frame")

2 个答案:

答案 0 :(得分:2)

接下来可以使用base R方法(有时可能很快):

#Code
result <- aggregate(Type~Month,df,function(x) length(unique(x)))

输出:

  Month Type
1   Dec    1
2   Feb    1
3   Jan    1
4   Jun    2
5   Mar    1
6   May    1

或者也许:

#Code 2
result2 <- aggregate(id~Month,df,function(x) length(unique(x)))

输出:

  Month id
1   Dec  1
2   Feb  2
3   Jan  3
4   Jun  2
5   Mar  2
6   May  3

根据预期的输出,您可以尝试以下操作:

#Code
new <- aggregate(id~Month+Type,data=df,function(x) length(unique(x)))

输出:

  Month           Type id
1   Dec        Fishing  1
2   Feb        Fishing  2
3   Jan        Fishing  3
4   Jun        Fishing  1
5   May Passenger ship  3
6   Jun        Sailing  1
7   Mar            Tug  2

或使用dplyr

library(dplyr)            
#Code
new <- df %>% group_by(Month,Type) %>% summarise(N=length(unique(id)))

输出:

# A tibble: 7 x 3
# Groups:   Month [6]
  Month Type               N
  <fct> <fct>          <int>
1 Dec   Fishing            1
2 Feb   Fishing            2
3 Jan   Fishing            3
4 Jun   Fishing            1
5 Jun   Sailing            1
6 Mar   Tug                2
7 May   Passenger ship     3

答案 1 :(得分:1)

我们可以使用n_distinct来按“月”查找唯一的“类型”数

library(dplyr)
df %>% 
      group_by(Month) %>% 
      summarise(n = n_distinct(Type))

-输出

# A tibble: 6 x 2
#  Month     n
#  <fct> <int>
#1 Dec       1
#2 Feb       1
#3 Jan       1
#4 Jun       2
#5 Mar       1
#6 May       1

如果它基于“ id”

df %>%
    group_by(Month) %>%
    summarise(n = n_distinct(id))

-输出

# A tibble: 6 x 2
#  Month     n
#  <fct> <int>
#1 Dec       1
#2 Feb       2
#3 Jan       3
#4 Jun       2
#5 Mar       2
#6 May       3

或者另一种选择是获取distinct行并使用count

 df %>% 
      distinct(Month, Type) %>%
      count(Month)

或与data.table

library(data.table)
setDT(df)[, .(n = uniqueN(Type)), Month]

或与base R

aggregate(Type ~ Month, unique(df[c('Type', 'Month')]), length)
aggregate(id ~ Month, unique(df[c('id', 'Month')]), length)

关于base R,特别是aggregate的效率,它会慢到here