我是R
的初学者,我所做的每件事都来自我从其他语言中学到的典型方法。但是,每当我在这里寻找R
相关答案时,代码结构与我预期的完全不同。
我有一个data.table,其中包含个人的面板数据。我想看一个特征的平均结果,然后将样本分成两次:高于平均结果中位数的那些,以及低于平均结果的中位数。
这是我的data.table的结构,yearly
:
user wage year
1: 65122111 9.74 2003
2: 65122111 7.85 2004
3: 65122111 97.16 2005
4: 65122111 48.22 2006
5: 65122111 91.24 2007
6: 65122111 9.35 2008
7: 65122112 80.00 2007
8: 65122112 0.00 2008
这就是我的所作所为:
## get mean wages
meanWages <- yearly[, list(meanWage = mean(wage)), by=(user)]
## split by median
highWage <- meanWages[meanWage > median(meanWages[, meanWage]), user]
lowWage <- meanWages[meanWage < median(meanWages[, meanWage]), user]
## split original sample
yearlyHigh <- yearly[is.element(user,highWage),]
yearlyLow <- yearly[is.element(user,highWage),]
我认为这给了我的期望(检查正确性是非常麻烦的),但它似乎非常笨重和低效。什么是更有效和压缩的方式做同样的事情?
答案 0 :(得分:3)
您可以尝试以下方法,但我无法确定这是最有效还是最紧凑的。
yearly[, meanwage := mean(wage), by=user]
yearlyHigh <- yearly[meanwage >= median(meanwage)]
yearlyLow <- yearly[meanwage < median(meanwage)]
答案 1 :(得分:3)
您还可以使用dplyr
包。可能不那么有效,但它很容易阅读。
yearly %>%
group_by(user) %>%
mutate(meanwage = mean(wage)) %>%
filter(meanwage >= median(meanwage))
实际拆分数据很少有用。只需按工资类别进行分组,然后使用分组操作。
yearly %>%
group_by(user) %>%
mutate(meanwage = mean(wage)) %>%
ungroup %>%
mutate(cat = ifelse(meanwage >= median(meanwage), "high", "low")) %>%
group_by(cat) %>%
do(data.table("further analyses here ..."))
或者只使用data.table
:
yearly[, meanwage := mean(wage), by=user]
yearly[, cat := ifelse(meanwage >= median(meanwage), "high", "low")]
yearly[, "further analyses here ...", by = cat]