优化复杂的分组变异依赖于行(查找?)

时间:2016-11-08 15:42:46

标签: r dplyr

我现在越来越频繁地遇到一个问题(在变体中)。我怀疑有一种更有效的方法,并且会喜欢一些指针。

我在下面创建的玩具示例并没有那个,但是当我在我的真实数据上使用几个这样的查找函数时,它可能需要更长时间。 基本上,目的是计算满足多个条件的兄弟姐妹。因为它取决于每个人活着的时间,所以每个兄弟姐妹的结果都不一样。

library(dplyr)
# sample data
sibs = tbl_df(data.frame(survive1y = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 
1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 
1, 1, 0, 0, 1, 0, 1), byear = c(1717L, 1719L, 1721L, 1723L, 1724L, 
1725L, 1727L, 1728L, 1730L, 1732L, 1733L, 1735L, 1736L, 1738L, 
1740L, 1740L, 1742L, 1738L, 1744L, 1746L, 1748L, 1749L, 1753L, 
1755L, 1757L, 1758L, 1759L, 1761L, 1762L, 1764L, 1767L, 1717L, 
1719L, 1721L, 1786L, 1773L, 1767L, 1768L, 1792L), dyear = c(1748L, 
1791L, 1760L, 1795L, 1765L, 1756L, 1730L, 1733L, 1733L, 1732L, 
1755L, 1800L, 1736L, 1738L, 1740L, 1740L, 1761L, 1816L, 1744L, 
1748L, 1748L, 1749L, 1754L, 1756L, 1757L, 1759L, 1815L, 1761L, 
1765L, 1783L, 1768L, 1800L, 1750L, 1757L, 1786L, 1773L, 1769L, 
1768L, 1793L)))
sibs = bind_rows(replicate(10000, sibs, simplify = F))
sibs$idParents = rep(1:(nrow(sibs)/10), each = 10, length.out = nrow(sibs))

# get the number of siblings who were alive and dependent 
# in the first five years of this individual
dependent_sibs_f5y = function(survive1y, byear, dyear) {
    sibs = length(byear)
    other_dependent_sibs_f5y = integer(length=sibs)
    for(i in 1:sibs) {
        # remove this sib
        other_births = byear[-i]
        other_deaths = dyear[-i]
        other_made1y = survive1y[-i]
        my_sibs = sibs - 1 - # minus self
            sum(
                other_births > (byear[i] + 5) | # born more than 5y later
                (other_births + 5) < byear[i] | # finished infancy before birth
                other_deaths <= byear[i] | # died before birth
                other_made1y == 0, # if they died right away, don't count
            na.rm=T)  # if dyear missing assume they lived
        other_dependent_sibs_f5y[i] = my_sibs
    }
    other_dependent_sibs_f5y
}

system.time({
sibs2 = sibs %>%
   group_by(idParents) %>%
   mutate(
       dependent_sibs_f5y = 
       dependent_sibs_f5y(survive1y=survive1y, byear=byear, dyear=dyear)
   )
 })

2 个答案:

答案 0 :(得分:1)

事实证明,一旦我在 dplyr之前加载函数覆盖dplyr的命名空间(由于加载顺序混淆而意外没有),我的方法并不是那么慢。 只有通过制作这个可重复的例子才能弄明白,对不起浪费时间。 通过使用时间序列优化的方法可能会有更快的解决方案,但这个方法可以正常工作。

答案 1 :(得分:-1)

一般而言,当人们想要计算/聚合群组内特定事件的发生时,快速而强大的策略是执行以下操作:

  1. 为每个活动创建dummy variables。比如说,如果满足条件,变量一个
  2. 使用group_by并按组对虚拟变量求和。对假人进行求和将给出计数,对假人进行平均将给出平均值(或概率)