我有一个庞大的数据集,其中的数据点会随时间而偶尔收集。基本上是GPS跟踪数据,当接收天线到达时会收集这些数据-但是,有时分辨率可能会太高,例如每分钟左右-这是不必要的,这使得映射它成为处理上的挑战,所以我想减少它。
我能做的最好的方法是过分地过滤数据,以使日期和时间到小时是唯一的-这将减少分钟数据点的数量。但是,它也必须由单个标识符完成-在这种情况下为“名称”。因为某些日期/时间可能会与名称不同的对象交叉。
我不会特别费心地根据单个小时选择哪一行,也不必对其进行平均等。对最佳方法的任何想法吗?
以下是一些虚拟数据:
df <- structure(list(`Local Time` = structure(c(1559388960, 1559389200,
1559394840, 1559397180, 1559397900, 1559398380, 1559398560, 1559398680,
1559398740, 1559398800, 1559399160, 1559399280, 1559399400, 1559399580,
1559399640, 1559399820, 1559399940, 1559400120, 1559400240, 1559400780,
1559400840, 1559400960, 1559401080, 1559401260, 1559401380, 1559383560,
1559389200, 1559389440, 1559395080, 1559395320, 1559397180, 1559397900,
1559398200, 1559398440, 1559398680, 1559398920, 1559399220, 1559399520,
1559399820, 1559400120, 1559400360, 1559400660, 1559400960, 1559401200,
1559401500, 1559401740, 1559402040, 1559402280, 1559402580, 1559402880
), class = c("POSIXct", "POSIXt"), tzone = ""), COG = c(315,
352.6, 265.6, 214.9, 240.8, 245.5, 240.3, 250.5, 262.4, 269.8,
281.1, 262.9, 253.1, 247.7, 255.5, 249.4, 263.2, 268.6, 279.6,
274.3, 254.6, 246.6, 253.7, 242.3, 163.5, 90, 88, 89, 93, 96,
95, 97, 97, 98, 98, 95, 93, 94, 92, 91, 91, 91, 91, 90, 90, 92,
89, 89, 89, 88), NAME = c("Aur", "Aur", "Aur", "Aur", "Aur",
"Aur", "Aur", "Aur", "Aur", "Aur", "Aur", "Aur", "Aur", "Aur",
"Aur", "Aur", "Aur", "Aur", "Aur", "Aur", "Aur", "Aur", "Aur",
"Aur", "Aur", "Cos", "Cos", "Cos", "Cos", "Cos", "Cos", "Cos",
"Cos", "Cos", "Cos", "Cos", "Cos", "Cos", "Cos", "Cos", "Cos",
"Cos", "Cos", "Cos", "Cos", "Cos", "Cos", "Cos", "Cos", "Cos"
)), row.names = c(NA, -50L), class = c("tbl_df", "tbl", "data.frame"))
答案 0 :(得分:3)
使用round.POSIXt
(和as.POSIXct
,因为前者返回POSIXlt
,而dplyr
不喜欢):
library(dplyr)
df %>%
group_by(NAME, rtime = as.POSIXct(round.POSIXt(`Local Time`, units = "hours"))) %>%
slice(1)
# # A tibble: 9 x 4
# # Groups: NAME, rtime [9]
# `Local Time` COG NAME rtime
# <dttm> <dbl> <chr> <dttm>
# 1 2019-06-01 04:36:00 315 Aur 2019-06-01 05:00:00
# 2 2019-06-01 06:14:00 266. Aur 2019-06-01 06:00:00
# 3 2019-06-01 06:53:00 215. Aur 2019-06-01 07:00:00
# 4 2019-06-01 07:30:00 253. Aur 2019-06-01 08:00:00
# 5 2019-06-01 03:06:00 90 Cos 2019-06-01 03:00:00
# 6 2019-06-01 04:40:00 88 Cos 2019-06-01 05:00:00
# 7 2019-06-01 06:18:00 93 Cos 2019-06-01 06:00:00
# 8 2019-06-01 06:53:00 95 Cos 2019-06-01 07:00:00
# 9 2019-06-01 07:32:00 94 Cos 2019-06-01 08:00:00
如果愿意,可以改为使用slice(n())
返回 last 或使用sample_n(1)
返回随机行。
答案 1 :(得分:1)
也可以在data.table
中完成;给定您的数据集大小,我认为这将为您节省一些计算资源:
library(data.table)
setDT(df1)[, .SD[1], by=list(NAME, DateTime = substr(`Local Time`,1, 13))]
# NAME DateTime COG
# 1: Aur 2019-06-01 07 315.0
# 2: Aur 2019-06-01 09 265.6
# 3: Aur 2019-06-01 10 240.8
# 4: Aur 2019-06-01 11 242.3
# 5: Cos 2019-06-01 06 90.0
# 6: Cos 2019-06-01 07 88.0
# 7: Cos 2019-06-01 09 93.0
# 8: Cos 2019-06-01 10 97.0
# 9: Cos 2019-06-01 11 90.0
您还可以使用.SD[.N]
来获取最后一行。