随机抽样dataframe变量的子集

时间:2011-12-07 11:21:08

标签: r plyr

我正在研究一个大型数据集,其中包含一周内的旅行行为数据。在一周的时间里,人们已经完成了他们在那一周内进行的个人旅行的记录。个人通过唯一的识别号码(ID)识别。我想要做的是从每个唯一ID可用的每周数据中选择两天的日记数据(可能包含一次或多次旅行),并将其放入新的数据框中。下面详细介绍了一个示例数据框架:

Df1 <- data.frame(ID = c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3), 
                  date = c("1st Nov", "1st Nov", "3rd Nov", "4th Nov","4th Nov","5th Nov","2nd Nov", "2nd Nov", "3nd Nov", "4th Nov","5th Nov","5th Nov","2nd Nov", "2nd Nov", "3nd Nov", "4th Nov","5th Nov"))

感谢上述任何帮助。

非常感谢,

凯蒂

1 个答案:

答案 0 :(得分:8)

听起来像普莱尔的工作。为每个用户抽样两个随机日:

library(plyr)
ddply(Df1, .(ID), function(x) {
  unique_days = as.character(unique(x$date))
  if(length(unique_days) < 2) {
    randomSelDays = unique_days
  } else {
    randomSelDays = sample(unique_days, 2)        
  }
  return(x[x$date %in% randomSelDays,])
})

这将返回每个唯一标识符的两个选定日期的所有数据。此外,如果ID只有一天,则返回该日期。例如:

  ID    date
1  1 1st Nov
2  1 1st Nov
3  1 3rd Nov
4  2 3nd Nov
5  2 5th Nov
6  2 5th Nov
7  3 2nd Nov
8  3 2nd Nov
9  3 3nd Nov