使用嵌套层次结构在data.frame上滑动窗口

时间:2015-09-01 02:54:03

标签: r performance dataframe

数据说明

我的data.frame代表不同国家/地区(salary)生活在不同城市(city)的country人。城市名称,国家名称和工资是整数。在我的data.frame中,变量country是有序的,变量city在每个country内排序,变量salary在每个city内排序(和country)。还有两个名为arg1arg2的列,其中包含浮点数/双精度数。

目标

对于每个国家/地区和每个城市,我想考虑一个工资大小为WindowSize的窗口,并在此窗口上计算D = sum(arg1)/sum(arg2)。然后,应重新计算WindowStepD的窗口滑动,依此类推。例如,我们考虑WindowSize = 1000WindowStep = 10。在每个国家和每个城市内,我希望获得D 0到1000之间的工资范围,以及10到1010之间以及20和1020等范围内的工资......

最后,输出应该是一个data {frame,将D统计信息与每个窗口相关联。如果给定窗口没有条目(例如,在国家1,城市3中没有人的薪水在20到1020之间),则D统计信息应为NA

关于效果的说明

我必须在相当大的桌子上运行这个算法大约10000次(与国家,城市和工资无关;我还没有对这些桌子的大小做出很好的估计),所以表现令人担忧。

示例数据

set.seed(84)
country = rep(1:3, c(30, 22, 51))
city = c(rep(1:5, c(5,5,5,5,10)), rep(1:5, c(1,1,10,8,2)), rep(c(1,3,4,5), c(20, 7, 3, 21)))
tt = paste0(city, country)
salary = c()
for (i in unique(tt)) salary = append(salary, sort(round(runif(sum(tt==i), 0,100000))))

arg1 = rnorm(length(country), 1, 1)
arg2 = rnorm(length(country), 1, 1)
dt = data.frame(country = country, city = city, salary = salary, arg1 = arg1, arg2 = arg2)
head(dim)
  country city salary       arg1        arg2
1       1    1  22791 -1.4606212  1.07084528
2       1    1  34598  0.9244679  1.19519158
3       1    1  76411  0.8288587  0.86737330
4       1    1  76790  1.3013056  0.07380115
5       1    1  87297 -1.4021137  1.62395596
6       1    2  12581  1.3062181 -1.03360620

通过此示例,如果windowSize = 70000windowStep = 30000,则D的第一个值为-0.2366040.439462,它们是sum(dt$arg1[1:2])/sum(dt$arg2[1:2])的结果分别和sum(dt$arg1[2:5])/sum(dt$arg2[2:5])

3 个答案:

答案 0 :(得分:3)

除非我误解了某些内容,否则以下内容可能会有所帮助。

定义一个简单的函数,无论层次结构分组如何:

ff = function(salary, wSz, wSt, arg1, arg2) 
{
    froms = (wSt * (0:ceiling(max(salary) / wSt)))
    tos = froms + wSz
    Ds = mapply(function(from, to, salaries, args1, args2) {
                  inds = salaries > from & salaries < to
                  sum(args1[inds]) / sum(args2[inds])
                },          
                from = froms, to = tos, 
                MoreArgs = list(salaries = salary, args1 = arg1, args2 = arg2))
    list(from = froms, to = tos, D = Ds)                
}

使用例如data.table

计算组
library(data.table)
dt2 = as.data.table(dt)
ans = dt2[, ff(salary, 70000, 30000, arg1, arg2), by = c("country", "city")]
head(ans, 10)
#    country city  from     to          D
# 1:       1    1     0  70000 -0.2366040
# 2:       1    1 30000 100000  0.4394620
# 3:       1    1 60000 130000  0.2838260
# 4:       1    1 90000 160000        NaN
# 5:       1    2     0  70000  1.8112196
# 6:       1    2 30000 100000  0.6134090
# 7:       1    2 60000 130000  0.5959344
# 8:       1    2 90000 160000        NaN
# 9:       1    3     0  70000  1.3216255
#10:       1    3 30000 100000  1.8812397

即。

的速度更快
lapply(split(dt[-c(1, 2)], interaction(dt$country, dt$city, drop = TRUE)),
       function(x) as.data.frame(ff(x$salary, 70000, 30000, x$arg1, x$arg2)))

答案 1 :(得分:1)

如果没有您的预期结果,我很难猜测我的结果是否正确,但它应该为您提供第一步的先机。从性能的角度来看,data.table包非常快。比循环快得多。

set.seed(84)
country <- rep(1:3, c(30, 22, 51))
city <- c(rep(1:5, c(5,5,5,5,10)), rep(1:5, c(1,1,10,8,2)), rep(c(1,3,4,5), c(20, 7, 3, 21)))
tt <- paste0(city, country)
salary <- c()
for (i in unique(tt)) salary <- append(salary, sort(round(runif(sum(tt==i), 0,100000))))

arg1 <- rnorm(length(country), 1, 1)
arg2 <- rnorm(length(country), 1, 1)
dt <- data.frame(country = country, city = city, salary = salary, arg1 = arg1, arg2 = arg2)
head(dt)

# For data table
require(data.table)
# For rollapply
require(zoo)
setDT(dt)

WindowSize <- 10
WindowStep <- 3
dt[, .(D = (rollapply(arg1, width = WindowSize, FUN = sum, by = WindowStep) / 
            rollapply(arg2, width = WindowSize, FUN = sum, by = WindowStep)), 
       by = list(country = country, city = city))]

您可以通过融合数据并执行和编写自定义汇总函数来实现目标的后半部分,该功能可用于再次将数据汇总在一起。

答案 2 :(得分:0)

Table = NULL
StepNumber = 100
WindowSize = 1000
WindowRange = c(0,WindowSize)
WindowStep = 100
for(x in dt$country){
     #subset of data for that country
     CountrySubset = dt[dt$country == x,,drop=F]
     for(y in CountrySubset$city){
        #subset of data for citys within country
        CitySubset = CountrySubset[CountrySubset$city == y,,drop=F]
        for(z in 1:StepNumber){
            WinRange = WindowRange + (z*WindowStep)
            #subset of salarys within country of city via windowRange
            WindowData = subset(CitySubset, salary > WinRange[1] & salary < WinRange[2])
            CalcD = sum(WindowData$arg1)/sum(WindowData$arg2)
            Output = c(Country = x, City = y, WinStart = WinRange[1], WinEnd = WinRange[2], D = CalcD)
            Table = rbind(Table,Output)

        }
    }
}

使用您的示例代码,这应该有效,它只是一系列将写入Table的嵌套循环。然而,它偶尔会复制一行,因为我知道继续向表中添加结果的唯一方法是rbind。

所以,如果有人可以改变它来解决这个问题。应该是好的。

WindowStep是您想要的每个连续WindowSize之间的差异。

StepNumber是你想要采取多少步骤,最好找出最高工资是多少,然后再调整。