按组对最接近整个样本中位数的子集进行子集观察

时间:2013-04-10 19:59:39

标签: r subset

我有多个网站 - 每次访问多次。我想将数据分组以仅包括每个站点的一次访问(但是来自该访问的所有观察),并且我希望该访问最接近(及时)到所有站点的所有访问的中间日期。

示例数据:

d = data.table(site = c('a', 'a','a','a','b', 'b','b', 'b', 'c', 'c', 'c', 'c'), 
       sex = c('m','f','m','f','m','f','m','f','m','f','m','f'), 
       date = c(127,127, 185, 185, 132,132, 189,189, 119,119, 178, 178), 
       count = c(12, 15, 10, 9, 18, 22,12, 15, 10, 9, 18, 22)) 

我想得到什么:

d = data.table(site = c('a', 'a','b', 'b', 'c', 'c'), 
     sex = c('m','f','m','f','m','f'),
     date = c(127,127, 132,132, 178, 178), 
     count = c(12, 15,18, 22, 18, 22))

2 个答案:

答案 0 :(得分:1)

library(data.table)

d = data.table(site = c('a', 'a', 'b', 'b', 'c', 'c'),
               date = c(127, 185, 132, 189, 119, 178),
               count = c(12, 15, 10, 9, 18, 22))

d.median = d[, median(date)]
d[, {i = which.min(abs(date - d.median));
     list(date = date[i], count = count[i])},
  by = list(sex, site)]

答案 1 :(得分:1)

以下是使用基础R中的averank的一种方法

myRanks <- with(mydf, ave(date, site, FUN = function(x) 
  rank(abs(x - median(date)), ties.method = "first")))
mydf[myRanks == 1, ]
#   site date count
# 1    a  127    12
# 3    b  132    10
# 6    c  178    22

rank用于帮助处理您可能有多个“最接近”中位数值的情况。