我有一个庞大的(约1.23亿个观测值)面板数据集,其中包含几对系列的数据,例如: amount_old
和amount_new
。系列amount_new
在时间上比系列amount_old
向前延伸,所以我想使用从amount_old
计算的增长率来推断amount_new
的值。
以下是一个小样本数据集:
clear
input str3 str_id year amount_old amount_new
aaa 2000 1105.34 1568.2
aaa 2001 1122.6268 1571.8486
aaa 2002 1132.0478 1605.832
aaa 2003 1186.9295 1666.4644
aaa 2004 1187.2502 1714.0043
aaa 2005 1230.0004 1744.4136
aaa 2006 1252.9979 1821.2219
aaa 2007 1289.5164 1855.4785
aaa 2008 1351.6705 1864.0597
aaa 2009 1353.639 1877.5152
aaa 2010 1398.2009 1916.5298
aaa 2011 . 1921.5906
aaa 2012 . 2003.8804
aaa 2013 . 2051.6525
aaa 2014 . 2072.8235
bbb 2000 7964.3029 9043.68
bbb 2001 8062.8454 9319.9098
bbb 2002 8223.277 9415.5202
bbb 2003 8605.8333 9760.014
bbb 2004 8636.8787 10024.964
bbb 2005 8927.8641 10327.588
bbb 2006 9284.91 10408.275
bbb 2007 . 10693.495
bbb 2008 . 11141.559
bbb 2009 . 11367.394
bbb 2010 . 11671.628
bbb 2011 . 11994.248
ccc 1990 20593.59 31049.493
ccc 1991 20723.578 31364.674
ccc 1992 21119.377 32870.953
ccc 1993 . 33138.507
ccc 1994 . 33383.829
ccc 1995 . 33776.957
ccc 1996 . 33966.004
ccc 1997 . 34324.091
ccc 1998 . 35744.175
end
加载数据后,我可以通过循环遍历每个观察结果来推断:
encode str_id, gen(id)
xtset id year
gen amount_new_gr = amount_new / L.amount_new - 1
forv i = 1/`=_N' {
if missing(amount_old[`i']) {
replace amount_old = amount_old[`=`i'-1'] * (1 + amount_new_gr[`i']) in `i'
}
}
但是速度相当慢且数据集庞大,我需要为大约45对系列(series1_old
,series1_new
,series2_old
等执行此操作)。
有没有办法在Stata 13中使用滞后运算符或面板数据集的其他特性来做到这一点?
答案 0 :(得分:1)
假设你真的想这样做(统计上可能不是你最好的选择),试试代码中提供的替代方案:
clear
set more off
*----- exmple data -----
input str3 str_id year amount_old amount_new
aaa 2000 1105.34 1568.2
aaa 2001 1122.6268 1571.8486
aaa 2002 1132.0478 1605.832
aaa 2003 1186.9295 1666.4644
aaa 2004 1187.2502 1714.0043
aaa 2005 1230.0004 1744.4136
aaa 2006 1252.9979 1821.2219
aaa 2007 1289.5164 1855.4785
aaa 2008 1351.6705 1864.0597
aaa 2009 1353.639 1877.5152
aaa 2010 1398.2009 1916.5298
aaa 2011 . 1921.5906
aaa 2012 . 2003.8804
aaa 2013 . 2051.6525
aaa 2014 . 2072.8235
bbb 2000 7964.3029 9043.68
bbb 2001 8062.8454 9319.9098
bbb 2002 8223.277 9415.5202
bbb 2003 8605.8333 9760.014
bbb 2004 8636.8787 10024.964
bbb 2005 8927.8641 10327.588
bbb 2006 9284.91 10408.275
bbb 2007 . 10693.495
bbb 2008 . 11141.559
bbb 2009 . 11367.394
bbb 2010 . 11671.628
bbb 2011 . 11994.248
ccc 1990 20593.59 31049.493
ccc 1991 20723.578 31364.674
ccc 1992 21119.377 32870.953
ccc 1993 . 33138.507
ccc 1994 . 33383.829
ccc 1995 . 33776.957
ccc 1996 . 33966.004
ccc 1997 . 34324.091
ccc 1998 . 35744.175
end
// create more observations
expand 60000
bysort str_id year : gen idpre = _n
egen id = group(idpre str_id)
order id
drop str_id idpre
// xtset the data
xtset id year
// clear timers
timer clear
*----- original -----
timer on 1
gen amount_new_gr = amount_new / L.amount_new - 1
clonevar amount_old2 = amount_old
quietly forv i = 1/`=_N' {
if missing(amount_old2[`i']) {
replace amount_old2 = amount_old2[`=`i'-1'] * (1 + amount_new_gr[`i']) in `i'
}
}
timer off 1
*----- alternative -----
timer on 2
gen growth = amount_new / L.amount_new
clonevar amount_old3 = amount_old
quietly bysort id : replace amount_old3 = L.amount_old3 * growth ///
if missing(amount_old3)
timer off 2
// results
timer list
timer
命令允许我们对两个版本进行基准测试;你原来的(1)和建议的替代方案(2)。时间以秒为单位测量:
. timer list
1: 36.82 / 1 = 36.8180
2: 0.83 / 1 = 0.8260
使用大约200万次观察的数据集,使用替代方案时速度会大幅提高。
此外,代码更简单,读取更容易。请注意,我使用的是if
限定符,而不是if
命令(请参阅the difference)。鉴于Stata自动为我们做到这一点,因此无需循环观察。
另请阅读help by
,这是Stata中一个基本且非常重要的结构。