我有一个数据文件(.csv),其中每个观察点是333个区域之一。每个区都有一个ID,如1101,1102,.......其次,我有另一个数据文件(.csv),其中每个观察是112,975个城镇之一,包括人口数据。城镇数据有一个district_ID字段。每个区有大约300个城镇。因此,有一个区域district_ID == 1101
和大约300个城镇district_ID == 1101
。
我想在我的分区数据集中创建一个区级人口变量。这意味着将多个城镇观测与每个单一区域观测相匹配,并对城镇级人口进行求和。
谢谢!
答案 0 :(得分:7)
data.table解决方案:
#some example data
set.seed(42)
districts <- data.frame(district_ID=1:10,whatever=rnorm(10))
towns <- data.frame(town=1:100,district_ID=rep(1:10,each=10),
population=rpois(100,sample(c(1e3,1e4,1e5))))
library(data.table)
districts <- data.table(districts,key="district_ID")
towns <- data.table(towns,key="district_ID")
#calculate district population
temp <- towns[,list(district_pop=sum(population)),by=district_ID]
#merge result with districts data.table
districts <- merge(districts,temp)
# district_ID whatever district_pop
# 1: 1 1.37095845 434886
# 2: 2 -0.56469817 334084
# 3: 3 0.36312841 342241
# 4: 4 0.63286260 433224
# 5: 5 0.40426832 334039
# 6: 6 -0.10612452 342810
# 7: 7 1.51152200 433362
# 8: 8 -0.09465904 333810
# 9: 9 2.01842371 342035
# 10: 10 -0.06271410 432302
答案 1 :(得分:4)
编辑:使用更大的数据集进行基准测试。
使用tapply函数计算每个区的人口:
districtdata$population<-
tapply(towndata$population,towndata$district_ID,sum)[districts$district_ID]
一些基准测试,只是为了好玩:
fn1<-function(districts,towns)
{
districts$population<-
tapply(towns$population,towns$district_ID,sum)[districts$district_ID]
districts
}
fn2<-function(districts,towns) #Roland's data.table approach:
{
districts <- data.table(districts,key="district_ID")
towns <- data.table(towns,key="district_ID")
temp<-towns[,list(district_pop=sum(population)),by=district_ID]
merge(districts,temp)
}
set.seed(42)
districts <- data.frame(district_ID=1:300,whatever=rnorm(300))
towns <- data.frame(town=1:100000,district_ID=rep(1:300,each=300),
population=rpois(300000,sample(c(1e3,1e4,1e5))))
microbenchmark(fn1(districts,towns),fn2(districts,towns))
Unit: milliseconds
expr min lq median uq max neval
fn1(districts, towns) 215.29266 231.47103 243.72353 265.28280 355.43895 100
fn2(districts, towns) 20.03636 27.51046 36.11116 58.56448 88.70766 100
答案 2 :(得分:1)
怎么样:
aggregate(population ~ district_ID, towns, sum)
(基于Roland的合成数据)