R:分割这个样本的更好方法

时间:2015-04-14 13:19:14

标签: r data.table

我是R的初学者,我所做的每件事都来自我从其他语言中学到的典型方法。但是,每当我在这里寻找R相关答案时,代码结构与我预期的完全不同。

我有一个data.table,其中包含个人的面板数据。我想看一个特征的平均结果,然后将样本分成两次:高于平均结果中位数的那些,以及低于平均结果的中位数。

这是我的data.table的结构,yearly

       user     wage year
1: 65122111     9.74 2003
2: 65122111     7.85 2004
3: 65122111    97.16 2005
4: 65122111    48.22 2006
5: 65122111    91.24 2007
6: 65122111     9.35 2008
7: 65122112    80.00 2007
8: 65122112     0.00 2008

这就是我的所作所为:

## get mean wages
meanWages <- yearly[, list(meanWage = mean(wage)), by=(user)]
## split by median
highWage <- meanWages[meanWage > median(meanWages[, meanWage]), user]
lowWage <- meanWages[meanWage < median(meanWages[, meanWage]), user]
## split original sample
yearlyHigh <- yearly[is.element(user,highWage),]
yearlyLow <- yearly[is.element(user,highWage),]

我认为这给了我的期望(检查正确性是非常麻烦的),但它似乎非常笨重和低效。什么是更有效和压缩的方式做同样的事情?

2 个答案:

答案 0 :(得分:3)

您可以尝试以下方法,但我无法确定这是最有效还是最紧凑的。

yearly[, meanwage := mean(wage), by=user]
yearlyHigh <- yearly[meanwage >= median(meanwage)]
yearlyLow <- yearly[meanwage < median(meanwage)]

答案 1 :(得分:3)

您还可以使用dplyr包。可能不那么有效,但它很容易阅读。

yearly %>% 
  group_by(user) %>% 
  mutate(meanwage = mean(wage)) %>% 
  filter(meanwage >= median(meanwage))

实际拆分数据很少有用。只需按工资类别进行分组,然后使用分组操作。

yearly %>% 
  group_by(user) %>%
  mutate(meanwage = mean(wage)) %>%
  ungroup %>%
  mutate(cat = ifelse(meanwage >= median(meanwage), "high", "low")) %>%
  group_by(cat) %>%
  do(data.table("further analyses here ..."))

或者只使用data.table

yearly[, meanwage := mean(wage), by=user]
yearly[, cat := ifelse(meanwage >= median(meanwage), "high", "low")]
yearly[, "further analyses here ...", by = cat]