根据类别和日期发生减少/过滤数据

时间:2019-05-09 18:14:32

标签: r if-statement dplyr data-manipulation posixct

我有一个不同区域的不同船只的数据集。我得到的数据输出记录了船只的名称,类型(例如,捕鱼/货物)以及进入该区域的时间,它离开的时间以及该区域的停留时间/ DOS只是海上距离-或我正在寻找的区域在。

我的问题是渔船经常横断面,并且一天之内会一天多次进入和离开该区域,因此在我的报告输出中会多次被记录。

我想合并渔船数据,以便每天记录一次同名(仅用于类型:捕鱼)同名船的情况,除一个帐户外,所有帐户都将被删除。为简单起见,也许只是看“区域内首次看到的日期”,因为我认为当特定持续时间跨度数天时,情况可能会变得更加复杂(我稍后会再回到该想法)。

虚拟数据:

 df <- structure(list(Name = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 
 3L, 3L, 3L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 7L, 7L, 8L, 
 8L, 9L), .Label = c("A", "B", "C", "D", "E", "F", "G", "H", "I"
 ), class = "factor"), Type = structure(c(2L, 2L, 2L, 2L, 2L, 
 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 
 2L, 1L, 1L, 2L), .Label = c("Cargo", "Fishing"), class = "factor"), 
 `First seen inside` = structure(c(1556385360, 1556393640, 
 1556002200, 1556260260, 1556518860, 1556136660, 1556278500, 
 1556285820, 1556391480, 1556509620, 1556319480, 1556214120, 
 1556235600, 1556325540, 1556326920, 1556329500, 1556330220, 
 1556330580, 1556330880, 1556330940, 1556332980, 1556339880, 
 1556340900, 1556344140, 1556344500, 1556345220, 1556346420, 
 1556348220, 1556348520, 1556350860, 1556351460, 1556356620, 
 1556360220, 1556365920, 1556366520, 1556367180, 1556076420, 
 1556166900, 1556154840, 1556454900, 1556291220), class = c("POSIXct", 
 "POSIXt"), tzone = ""), `Last seen inside` = structure(c(34L, 
 35L, 1L, 8L, 38L, 3L, 7L, 9L, 36L, 38L, 27L, 4L, 5L, 10L, 
 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 
 23L, 24L, 25L, 26L, 28L, 29L, 30L, 31L, 32L, 33L, 2L, 6L, 
 37L, 38L, 38L), .Label = c("4/23/2019 14:27", "4/24/2019 21:23", 
 "4/25/2019 00:00", "4/25/2019 10:47", "4/25/2019 16:59", 
 "4/25/2019 23:49", "4/26/2019 05:17", "4/26/2019 13:39", 
 "4/26/2019 15:12", "4/26/2019 17:54", "4/26/2019 18:05", 
 "4/26/2019 18:51", "4/26/2019 19:00", "4/26/2019 19:06", 
 "4/26/2019 19:08", "4/26/2019 19:13", "4/26/2019 21:24", 
 "4/26/2019 21:38", "4/26/2019 22:02", "4/26/2019 22:51", 
 "4/26/2019 22:55", "4/26/2019 23:22", "4/26/2019 23:51", 
 "4/27/2019 00:00", "4/27/2019 00:36", "4/27/2019 00:42", 
 "4/27/2019 01:17", "4/27/2019 02:06", "4/27/2019 03:11", 
 "4/27/2019 04:30", "4/27/2019 05:00", "4/27/2019 05:03", 
 "4/27/2019 05:13", "4/27/2019 10:29", "4/27/2019 12:42", 
 "4/27/2019 17:21", "4/28/2019 03:47", "4/29/2019 09:56"), class = 
  "factor"), 
`Time in zone` = structure(c(5L, 31L, 6L, 7L, 2L, 3L, 23L, 
 30L, 26L, 4L, 32L, 27L, 9L, 8L, 22L, 28L, 22L, 22L, 1L, 24L, 
 15L, 1L, 29L, 18L, 1L, 8L, 17L, 22L, 19L, 16L, 14L, 25L, 
 13L, 31L, 16L, 1L, 12L, 10L, 21L, 11L, 20L), .Label = c("", 
 "10h 35m", "10h 49m", "13h 9m", "13m", "14h 37m", "14h 8m", 
 "15m", "19m", "1d 2h 14m", "1d 4h 21m", "1d 56m", "1h 13m", 
 "1h 15m", "1h 41m", "1m", "24m", "2m", "34m", "3d 1h 49m", 
 "3d 9h 33m", "3m", "42m", "4m", "54m", "5h 23m", "5m", "6m", 
 "7m", "8h 35m", "8m", "9h 19m"), class = "factor"), DOS = 
  structure(c(1L, 
 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "0-12", class = 
 "factor")), row.names = c(NA, 
 -41L), class = "data.frame")

例如,如果在我的虚拟数据集中:

  • 由于船舶“ A”是DOS 0-12中的“捕鱼”船,它在4月27日发生了两次,因此我想将输入的数据减少为一条记录-如果可能的话,将其总和总的“区域时间”和“最后看到的时间”将被传输到变异数据中,这将是巨大的-但是,如果这太复杂了,就不用太担心了。 因此,A舰只显示:

     Name      Type   First seen inside    Last seen inside  Time in zone    DOS
        A   Fishing     4/27/2019 12:16     4/27/2019 12:42           21m   0-12
    

    但是我很乐意将其减少到其中一行,如果太多,则不必更正最后一次看到的时间和区域中的时间。

  • 对于C船,由于它是货船,我不想像钓鱼一样对待它,并且即使每天有多个文件,我也希望保留所有文件化数据

  • 对于E舰,因为它出现在三个不同的日期,我希望其中有三个数据条目...

我希望这有意义吗?我不确定这是dplyr还是filter上基于当天乘法的mutate选项?关于如何管理这个“问题”的任何建议都是很棒的……或者也许我需要对数据集进行一些手动工作:(

1 个答案:

答案 0 :(得分:1)

df %>% group_by(Name,DOS,as.Date(`First seen inside`)) %>% 
  filter(Type=="Fishing") %>% 
  summarize(last=max(as.Date(`Last seen inside`, format="%m/%d/%Y")))

像这样?结果:

# A tibble: 10 x 4
# Groups:   Name, DOS [6]
   Name  DOS   `as.Date(\`First seen inside\`)` last      
   <fct> <fct> <date>                           <date>    
 1 A     0-12  2019-04-27                       2019-04-27
 2 B     0-12  2019-04-23                       2019-04-23
 3 B     0-12  2019-04-26                       2019-04-26
 4 B     0-12  2019-04-29                       2019-04-29
 5 D     0-12  2019-04-26                       2019-04-27
 6 E     0-12  2019-04-25                       2019-04-25
 7 E     0-12  2019-04-27                       2019-04-27
 8 G     0-12  2019-04-24                       2019-04-24
 9 G     0-12  2019-04-25                       2019-04-25
10 I     0-12  2019-04-26                       2019-04-29