R中的双循环

时间:2012-06-19 11:08:48

标签: r loops data.table

我是R的新手并且有关于循环的问题

在我的真实数据集中,有80个国家的7000个观测资料,有15个部门和6种类型的组织,但这里有一个简化的例子。

country <- c("a","a","a","a","a","a","b","b","b","b","b","b",
             "c","c","c","c","c","c","d","d","d","d","d","d")
sector <- c("a","a","a","b","c","c","a","b","b","b","c","c",
            "b","b","b","b","c","c","a","a","b","b","c","c")
organization <-c("a","b","c","c","b","a","a","b","b","c","b","b",
                 "c","a","a","b","b","c","c","b","a","a","b","c")
budget <-c(2,4,3,5,9,7,5,4,3,6,1,2,4,5,6,1,5,3,4,2,3,5,4,6)
table <- data.frame(country, sector, organization, budget)

我想要的是:

  1. 特定国家/地区特定行业中不同类型组织的数量。
  2. 给予不同类型组织的行业总预算的百分比。
  3. 我首先要制作一个子集,只选择来自国家“a”和扇区“a”的信息

    smalltable <-subset(table, (country == "a") & (sector == "a"))
    

    然后回答我的第一个问题,每个类型的组织中有多少是在一个国家的某个部门

    smalltable$count <- table(smalltable$organization)
    

    然后我需要找到财务百分比

    smalltable$percentage <- smalltable$budget / sum(smalltable$budget)
    

    然后我用了tapply

     N <- tapply(smalltable$count, smalltable$organization, FUN=sum)
     financialshare <- tapply(smalltable$percentage, smalltable$organization, FUN=sum)    
    

    最后合并了这个:

     total <- data.frame (smalltable$country,smalltable$sector,smalltable$organization, N,financialshare)
     total
    

    这是我需要的小桌子!

    但是我需要为所有15个扇区和所有80个国家/地区提供此功能,因此我需要某种循环功能来运行所有扇区的循环并为每个国家重复此循环。 我需要尽可能地缩小这些表格,将所有关于1个国家(所有15个扇区)的信息汇集到一个表格中。此外,应从表中删除零值以节省空间。

    我如何继续?

3 个答案:

答案 0 :(得分:3)

我会给data.table回答

library(data.table)
my_table=data.table(country, sector, organization, budget)
by_org=my_table[, list(count=.N, budget=sum(budget)),
                  keyby=list(country, sector, organization)]
total_budgets=my_table[, list(total_budget=sum(budget)),
                  keyby=list(country, sector)]
joined_table= total_budgets[by_org]
joined_table[,percentage:=budget/total_budget]
来自Matthew的编辑:在v1.8.1中,按组使用:=,不需要连接,因此更容易,更快,total_budget列添加到右边,这是更自然的放置在v1.8.0中使用连接的位置:

DT = data.table(country, sector, organization, budget) 
ans = DT[, list(count=.N, budget=sum(budget)),
           keyby=list(country, sector, organization)] 
ans[, total_budget:=sum(budget), by=list(country,sector)]
ans[, percentage:=budget/total_budget]

结果(使用v1.8.1):

    country sector organization count budget total_budget percentage
 1:       a      a            a     1      2            9  0.2222222
 2:       a      a            b     1      4            9  0.4444444
 3:       a      a            c     1      3            9  0.3333333
 4:       a      b            c     1      5            5  1.0000000
 5:       a      c            a     1      7           16  0.4375000
 6:       a      c            b     1      9           16  0.5625000
 7:       b      a            a     1      5            5  1.0000000
 8:       b      b            b     2      7           13  0.5384615
 9:       b      b            c     1      6           13  0.4615385
10:       b      c            b     2      3            3  1.0000000
11:       c      b            a     2     11           16  0.6875000
12:       c      b            b     1      1           16  0.0625000
13:       c      b            c     1      4           16  0.2500000
14:       c      c            b     1      5            8  0.6250000
15:       c      c            c     1      3            8  0.3750000
16:       d      a            b     1      2            6  0.3333333
17:       d      a            c     1      4            6  0.6666667
18:       d      b            a     2      8            8  1.0000000
19:       d      c            b     1      4           10  0.4000000
20:       d      c            c     1      6           10  0.6000000

这里要注意两件事:首先你的问题有点模糊,并且就计数和总和而言你真正想要的是什么,但希望我的片段对于我正在进行的计算是足够自我解释的。

其次,在R循环大量观察并不是惯用的,因为这往往很慢。大多数编程R一段时间的人倾向于使用向量操作,plyrdata.table或其他类似的包。

但要完成,循环结构如下:

for (item in list)
{
    ...
}

迭代常见索引......

for (i in 1:length(object))
{
    ...
}

答案 1 :(得分:2)

library(plyr)
ddply(table,.(country,sector), transform,count=as.vector(table(budget)),percentage=budget / sum(budget))

给出

   country sector organization budget count percentage
1        a      a            a      2     1  0.2222222
2        a      a            b      4     1  0.4444444
3        a      a            c      3     1  0.3333333
4        a      b            c      5     1  1.0000000
5        a      c            b      9     1  0.5625000
6        a      c            a      7     1  0.4375000
7        b      a            a      5     1  1.0000000
8        b      b            b      4     1  0.3076923
9        b      b            b      3     1  0.2307692
10       b      b            c      6     1  0.4615385
11       b      c            b      1     1  0.3333333
12       b      c            b      2     1  0.6666667
13       c      b            c      4     1  0.2500000
14       c      b            a      5     1  0.3125000
15       c      b            a      6     1  0.3750000
16       c      b            b      1     1  0.0625000
17       c      c            b      5     1  0.6250000
18       c      c            c      3     1  0.3750000
19       d      a            c      4     1  0.6666667
20       d      a            b      2     1  0.3333333
21       d      b            a      3     1  0.3750000
22       d      b            a      5     1  0.6250000
23       d      c            b      4     1  0.4000000
24       d      c            c      6     1  0.6000000

答案 2 :(得分:1)

您已完全使用plyr进行设置。通过这种方式,我的意思是你有一个(几乎)在一个子集上工作的过程,它准确地返回你想要的子集,现在你需要循环遍历所有可能的子集。我重新编写了代码以使其更紧凑并解决可能缺少organization的问题。

library("plyr")

ddply(table, .(country, sector), function(smalltable) {
  smalltable <- ddply(smalltable, .(organization), summarise, 
                      count=length(budget), budget=sum(budget))
  smalltable$percentage <- smalltable$budget / sum(smalltable$budget)
  smalltable
})

给出了

   country sector organization count budget percentage
1        a      a            a     1      2  0.2222222
2        a      a            b     1      4  0.4444444
3        a      a            c     1      3  0.3333333
4        a      b            c     1      5  1.0000000
5        a      c            a     1      7  0.4375000
6        a      c            b     1      9  0.5625000
7        b      a            a     1      5  1.0000000
8        b      b            b     2      7  0.5384615
9        b      b            c     1      6  0.4615385
10       b      c            b     2      3  1.0000000
11       c      b            a     2     11  0.6875000
12       c      b            b     1      1  0.0625000
13       c      b            c     1      4  0.2500000
14       c      c            b     1      5  0.6250000
15       c      c            c     1      3  0.3750000
16       d      a            b     1      2  0.3333333
17       d      a            c     1      4  0.6666667
18       d      b            a     2      8  1.0000000
19       d      c            b     1      4  0.4000000
20       d      c            c     1      6  0.6000000

请注意,table不是变量的好名称,因为它也是基函数的名称。