Question

我有一个庞大的（约1.23亿个观测值）面板数据集，其中包含几对系列的数据，例如： amount_old和amount_new。系列amount_new在时间上比系列amount_old向前延伸，所以我想使用从amount_old计算的增长率来推断amount_new的值。

以下是一个小样本数据集：

clear

input str3 str_id year amount_old amount_new
       aaa   2000     1105.34      1568.2  
       aaa   2001   1122.6268   1571.8486  
       aaa   2002   1132.0478    1605.832  
       aaa   2003   1186.9295   1666.4644  
       aaa   2004   1187.2502   1714.0043  
       aaa   2005   1230.0004   1744.4136  
       aaa   2006   1252.9979   1821.2219  
       aaa   2007   1289.5164   1855.4785  
       aaa   2008   1351.6705   1864.0597  
       aaa   2009    1353.639   1877.5152  
       aaa   2010   1398.2009   1916.5298  
       aaa   2011           .   1921.5906  
       aaa   2012           .   2003.8804  
       aaa   2013           .   2051.6525  
       aaa   2014           .   2072.8235  
       bbb   2000   7964.3029     9043.68  
       bbb   2001   8062.8454   9319.9098  
       bbb   2002    8223.277   9415.5202  
       bbb   2003   8605.8333    9760.014  
       bbb   2004   8636.8787   10024.964  
       bbb   2005   8927.8641   10327.588  
       bbb   2006     9284.91   10408.275  
       bbb   2007           .   10693.495  
       bbb   2008           .   11141.559  
       bbb   2009           .   11367.394  
       bbb   2010           .   11671.628  
       bbb   2011           .   11994.248  
       ccc   1990    20593.59   31049.493  
       ccc   1991   20723.578   31364.674  
       ccc   1992   21119.377   32870.953  
       ccc   1993           .   33138.507  
       ccc   1994           .   33383.829  
       ccc   1995           .   33776.957  
       ccc   1996           .   33966.004  
       ccc   1997           .   34324.091  
       ccc   1998           .   35744.175  
end

加载数据后，我可以通过循环遍历每个观察结果来推断：

encode str_id, gen(id)
xtset id year
gen amount_new_gr = amount_new / L.amount_new - 1
forv i = 1/`=_N' {
    if missing(amount_old[`i']) {
        replace amount_old = amount_old[`=`i'-1'] * (1 + amount_new_gr[`i']) in `i'
    }
}

但是速度相当慢且数据集庞大，我需要为大约45对系列（series1_old，series1_new，series2_old等执行此操作）。

有没有办法在Stata 13中使用滞后运算符或面板数据集的其他特性来做到这一点？

Answer 1

假设你真的想这样做（统计上可能不是你最好的选择），试试代码中提供的替代方案：

clear
set more off

*----- exmple data -----

input str3 str_id year amount_old amount_new
       aaa   2000     1105.34      1568.2  
       aaa   2001   1122.6268   1571.8486  
       aaa   2002   1132.0478    1605.832  
       aaa   2003   1186.9295   1666.4644  
       aaa   2004   1187.2502   1714.0043  
       aaa   2005   1230.0004   1744.4136  
       aaa   2006   1252.9979   1821.2219  
       aaa   2007   1289.5164   1855.4785  
       aaa   2008   1351.6705   1864.0597  
       aaa   2009    1353.639   1877.5152  
       aaa   2010   1398.2009   1916.5298  
       aaa   2011           .   1921.5906  
       aaa   2012           .   2003.8804  
       aaa   2013           .   2051.6525  
       aaa   2014           .   2072.8235  
       bbb   2000   7964.3029     9043.68  
       bbb   2001   8062.8454   9319.9098  
       bbb   2002    8223.277   9415.5202  
       bbb   2003   8605.8333    9760.014  
       bbb   2004   8636.8787   10024.964  
       bbb   2005   8927.8641   10327.588  
       bbb   2006     9284.91   10408.275  
       bbb   2007           .   10693.495  
       bbb   2008           .   11141.559  
       bbb   2009           .   11367.394  
       bbb   2010           .   11671.628  
       bbb   2011           .   11994.248  
       ccc   1990    20593.59   31049.493  
       ccc   1991   20723.578   31364.674  
       ccc   1992   21119.377   32870.953  
       ccc   1993           .   33138.507  
       ccc   1994           .   33383.829  
       ccc   1995           .   33776.957  
       ccc   1996           .   33966.004  
       ccc   1997           .   34324.091  
       ccc   1998           .   35744.175  
end

// create more observations
expand 60000

bysort str_id year : gen idpre = _n
egen id = group(idpre str_id)

order id
drop str_id idpre

// xtset the data
xtset id year

// clear timers
timer clear

*----- original -----

timer on 1

gen amount_new_gr = amount_new / L.amount_new - 1

clonevar amount_old2 = amount_old

quietly forv i = 1/`=_N' {
    if missing(amount_old2[`i']) {
        replace amount_old2 = amount_old2[`=`i'-1'] * (1 + amount_new_gr[`i']) in `i'
    }
}

timer off 1

*----- alternative -----

timer on 2

gen growth = amount_new / L.amount_new

clonevar amount_old3 = amount_old

quietly bysort id : replace amount_old3 = L.amount_old3 * growth ///
    if missing(amount_old3)

timer off 2

// results
timer list

timer命令允许我们对两个版本进行基准测试;你原来的（1）和建议的替代方案（2）。时间以秒为单位测量：

. timer list
   1:     36.82 /        1 =      36.8180
   2:      0.83 /        1 =       0.8260

使用大约200万次观察的数据集，使用替代方案时速度会大幅提高。

此外，代码更简单，读取更容易。请注意，我使用的是if 限定符，而不是if 命令（请参阅the difference）。鉴于Stata自动为我们做到这一点，因此无需循环观察。

另请阅读help by，这是Stata中一个基本且非常重要的结构。

如何使用面板数据中另一系列的增长率推断系列？

1 个答案: