我有多个网站 - 每次访问多次。我想将数据分组以仅包括每个站点的一次访问(但是来自该访问的所有观察),并且我希望该访问最接近(及时)到所有站点的所有访问的中间日期。
示例数据:
d = data.table(site = c('a', 'a','a','a','b', 'b','b', 'b', 'c', 'c', 'c', 'c'),
sex = c('m','f','m','f','m','f','m','f','m','f','m','f'),
date = c(127,127, 185, 185, 132,132, 189,189, 119,119, 178, 178),
count = c(12, 15, 10, 9, 18, 22,12, 15, 10, 9, 18, 22))
我想得到什么:
d = data.table(site = c('a', 'a','b', 'b', 'c', 'c'),
sex = c('m','f','m','f','m','f'),
date = c(127,127, 132,132, 178, 178),
count = c(12, 15,18, 22, 18, 22))
答案 0 :(得分:1)
library(data.table)
d = data.table(site = c('a', 'a', 'b', 'b', 'c', 'c'),
date = c(127, 185, 132, 189, 119, 178),
count = c(12, 15, 10, 9, 18, 22))
d.median = d[, median(date)]
d[, {i = which.min(abs(date - d.median));
list(date = date[i], count = count[i])},
by = list(sex, site)]
答案 1 :(得分:1)
以下是使用基础R中的ave
和rank
的一种方法
myRanks <- with(mydf, ave(date, site, FUN = function(x)
rank(abs(x - median(date)), ties.method = "first")))
mydf[myRanks == 1, ]
# site date count
# 1 a 127 12
# 3 b 132 10
# 6 c 178 22
rank
用于帮助处理您可能有多个“最接近”中位数值的情况。