R - 数据表中的条件数学表达式

时间:2016-08-12 21:54:44

标签: r data.table conditional aggregation bigdata

我正在与粮农组织的农业作物生产数据库合作。具体来说,我有一个数据表,显示从1961年到2014年,世界上每个国家每年的水稻产量。这是我的数据的简化版本:

d <- data.table(structure(list(Year = c(1961, 1962, 1963, 1964, 1965, 1966, 1967, 
1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 
1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 
1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 
2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 
2012, 2013, 2014, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 
1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 
1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 
1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 
2002, 2003, 2004, 2005, 2006), Country = c("Afghanistan", "Afghanistan", 
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", 
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", 
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", 
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", 
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", 
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", 
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", 
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", 
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", 
"Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", 
"Afghanistan", "Afghanistan", "Albania", "Albania", "Albania", 
"Albania", "Albania", "Albania", "Albania", "Albania", "Albania", 
"Albania", "Albania", "Albania", "Albania", "Albania", "Albania", 
"Albania", "Albania", "Albania", "Albania", "Albania", "Albania", 
"Albania", "Albania", "Albania", "Albania", "Albania", "Albania", 
"Albania", "Albania", "Albania", "Albania", "Albania", "Albania", 
"Albania", "Albania", "Albania", "Albania", "Albania", "Albania", 
"Albania", "Albania", "Albania", "Albania", "Albania", "Albania", 
"Albania"), totalprod = c(319000, 319000, 319000, 380000, 380000, 
337000, 396000, 402000, 407000, 366000, 350000, 4e+05, 420000, 
420000, 435000, 448000, 4e+05, 428000, 439000, 415000, 390000, 
364000, 350000, 334000, 317000, 336000, 324000, 343000, 320000, 
333000, 335000, 3e+05, 3e+05, 342000, 390000, 340000, 4e+05, 
450000, 280000, 260000, 242000, 388000, 434000, 463000, 485000, 
540000, 552000, 612000, 645000, 672000, 672000, 5e+05, 512094, 
537000, 4603, 5683, 9135, 8173, 10225, 10524, 11254, 12807, 14276, 
14924, 10760, 12000, 15168, 12000, 13500, 14000, 14400, 14800, 
15520, 13000, 13900, 11900, 13000, 12600, 12000, 11000, 10600, 
8830, 8450, 7000, 2283, 960, 585, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0)), .Names = c("Year", "Country", "totalprod"), class = "data.frame", row.names = c(NA, 
-100L)))

我需要以两种不同的方式总结这个数据表:

a)首先,我需要通过汇总每个国家/地区的年产量来计算每年的全球稻米产量。我设法通过这样做来回答这个问题:

d.global.year <- d[, list(totalprodyear=sum(totalprod)), by=Year]
b)每当我每年进行全球大米生产时,我需要确定每个国家/地区每年的贡献量。这可以通过每年将每个国家的产量除以全球稻米产量来实现。

但是,我仍在试图找出解决方法b)。

EDITED: 预期产出:

让我们以1961年全球大米消费量为例:323603

在这种情况下,阿富汗的贡献将是319000/323603 = 0.986,而阿尔巴尼亚的贡献将是4603/323603 = 0.014

任何提示?

2 个答案:

答案 0 :(得分:2)

这是一个基本解决方案,可以按照您要求的步骤进行分解。

xd <- xtabs( totalprod~ Year+Country, data=d)

xd <- cbind(xd, yr.total=rowSums(xd) )

 str(xd)
 num [1:54, 1:3] 319000 319000 319000 380000 380000 337000 396000 402000 407000 366000 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:54] "1961" "1962" "1963" "1964" ...
  ..$ : chr [1:3] "Afghanistan" "Albania" "yr.total"
 xd[ , -3]/xd[,3]
     Afghanistan     Albania
1961   0.9857758 0.014224219
1962   0.9824968 0.017503226
1963   0.9721608 0.027839152
1964   0.9789450 0.021055045
 #snipped table

答案 1 :(得分:1)

您可以在two-stage groupby流程中执行此操作,首先按Year分组并计算每年的总产品,然后按YearCountry进行分组您可以使用上一阶段计算的总产品来计算每个国家/地区贡献的比例:

sumDt <- d[, totalprodyear :=sum(totalprod), by=Year]
          [, .(totalprodyear, percentprod = sum(totalprod)/totalprodyear), by = .(Year, Country)]
sumDt[, head(.SD, 3), by = Country]
#        Country Year totalprodyear percentprod
# 1: Afghanistan 1961        323603  0.98577578
# 2: Afghanistan 1962        324683  0.98249677
# 3: Afghanistan 1963        328135  0.97216085
# 4:     Albania 1961        323603  0.01422422
# 5:     Albania 1962        324683  0.01750323
# 6:     Albania 1963        328135  0.02783915