Question

首先，看一下我的数据集的一些变量：

firm_id year    dyrstr  Lack    total_workers
2432    2002    1980        29
2432    2003    1980        23
2432    2005    1980    1   283
2432    2006    1980        56
2432    2007    1980        21
2433    2004    2001        42
2433    2006    2001    1   29
2433    2008    2001    1   100
2434    2002    2002        21
2434    2003    2002        55
2434    2004    2002        22
2434    2005    2002        24
2434    2006    2002        17
2434    2007    2002        40
2434    2008    2002        110
2434    2009    2002        158
2434    2010    2002        38
2435    2002    2002        80
2435    2003    2002        86
2435    2004    2002        877
2435    2005    2002        254
2435    2006    2002        71
2435    2007    2002        116
2435    2008    2002        118
2435    2009    2002        1165
2435    2010    2002        67
2436    2002    1992        24
2436    2003    1992        25
2436    2004    1992        22
2436    2005    1992        23
2436    2006    1992        21
2436    2007    1992        100
2436    2008    1992        73
2436    2009    1992        23
2436    2010    1992        40
2437    2002    2002        30
2437    2003    2002        31
2437    2004    2002        21
2437    2006    2002    1   56
2437    2007    2002        20

变量：

firm_id 是公司的标识符
年是观察年份
dyrstr 是公司的创始年
缺乏等于1，如果之前一年中缺少观察结果（例如，在数据集的第3行，Lack等于1，因为对于ID为2432的公司，年内没有观察2004）
total_workers 是工作人员数

我想填补空白，即我想在下面向您展示新的观察结果（仅考虑ID为2432的公司）：

firm_id year    dyrstr  Lack    total_workers
2432    2002    1980        29
*2432*  *2004* *1980*      *156*
2432    2003    1980        23
2432    2005    1980    1   283
2432    2006    1980        56
2432    2007    1980        21

我将变量值放在星号中的行是新创建的观察。这个观察结果应该是之前观察的副本，但需要进行一些修改。

firm_id 应与
年应该是上一行加上一年的年份
dyrstr 应与
缺乏：此变量具有哪个值
total_workers 等于0.5 *（前次观察值+连续观察值）
我的数据集的所有其他变量（我未在此处列出）应与之前的行保持一致

我读到了有关命令expand的内容，但help expand对我没什么帮助。希望你们中的一个可以帮助我！

Answer 1

我的建议取决于使用expand，而Lack只需要提供有关要添加的观察数量的信息。我忽略了你的变量total_workers，因为Stata本身可以解决差距所在。我的归档ipolate的过程基于使用内置命令cipolate，因此可以在超过1年的间隙中工作，这在您的示例中不会出现。如此估计的工人数量不一定是整数。

对于其他插值程序，请查看csipolate，pchipolate，ssc desc cipolate，所有这些都可通过sort（或等效的）访问。

这种操作取决于完全正确地获得list顺序，即使有经验，我认为这也是微不足道的，所以在为类似问题获取代码时，要为错误的开始做好准备;用. clear . input firm_id year dyrstr total_workers firm_id year dyrstr total_w~s 1. 2432 2002 1980 29 2. 2432 2003 1980 23 3. 2432 2005 1980 283 4. 2432 2006 1980 56 5. 2432 2007 1980 21 6. 2433 2004 2001 42 7. 2433 2006 2001 29 8. 2433 2008 2001 100 9. 2434 2002 2002 21 10. 2434 2003 2002 55 11. 2434 2004 2002 22 12. 2434 2005 2002 24 13. 2434 2006 2002 17 14. 2434 2007 2002 40 15. 2434 2008 2002 110 16. 2434 2009 2002 158 17. 2434 2010 2002 38 18. 2435 2002 2002 80 19. 2435 2003 2002 86 20. 2435 2004 2002 877 21. 2435 2005 2002 254 22. 2435 2006 2002 71 23. 2435 2007 2002 116 24. 2435 2008 2002 118 25. 2435 2009 2002 1165 26. 2435 2010 2002 67 27. 2436 2002 1992 24 28. 2436 2003 1992 25 29. 2436 2004 1992 22 30. 2436 2005 1992 23 31. 2436 2006 1992 21 32. 2436 2007 1992 100 33. 2436 2008 1992 73 34. 2436 2009 1992 23 35. 2436 2010 1992 40 36. 2437 2002 2002 30 37. 2437 2003 2002 31 38. 2437 2004 2002 21 39. 2437 2006 2002 56 40. 2437 2007 2002 20 41. end . scalar N = _N . bysort firm_id (year) : gen gap = year - year[_n-1] (6 missing values generated) . expand gap (6 missing counts ignored; observations not deleted) (4 observations created) . gen orig = _n <= scalar(N) . bysort firm_id (year) : replace total_workers = . if !orig (4 real changes made, 4 to missing) . bysort firm_id (year orig) : replace year = year[_n-1] + 1 if _n > 1 & year != year[_n-1] + 1 (4 real changes made) . bysort firm_id (year): ipolate total_workers year , gen(total_workers2) . list, sepby(firm_id) +------------------------------------------------------------+ | firm_id year dyrstr total_~s gap orig total_~2 | |------------------------------------------------------------| 1. | 2432 2002 1980 29 . 1 29 | 2. | 2432 2003 1980 23 1 1 23 | 3. | 2432 2004 1980 . 2 0 153 | 4. | 2432 2005 1980 283 2 1 283 | 5. | 2432 2006 1980 56 1 1 56 | 6. | 2432 2007 1980 21 1 1 21 | |------------------------------------------------------------| 7. | 2433 2004 2001 42 . 1 42 | 8. | 2433 2005 2001 . 2 0 35.5 | 9. | 2433 2006 2001 29 2 1 29 | 10. | 2433 2007 2001 . 2 0 64.5 | 11. | 2433 2008 2001 100 2 1 100 | |------------------------------------------------------------| 12. | 2434 2002 2002 21 . 1 21 | 13. | 2434 2003 2002 55 1 1 55 | 14. | 2434 2004 2002 22 1 1 22 | 15. | 2434 2005 2002 24 1 1 24 | 16. | 2434 2006 2002 17 1 1 17 | 17. | 2434 2007 2002 40 1 1 40 | 18. | 2434 2008 2002 110 1 1 110 | 19. | 2434 2009 2002 158 1 1 158 | 20. | 2434 2010 2002 38 1 1 38 | |------------------------------------------------------------| 21. | 2435 2002 2002 80 . 1 80 | 22. | 2435 2003 2002 86 1 1 86 | 23. | 2435 2004 2002 877 1 1 877 | 24. | 2435 2005 2002 254 1 1 254 | 25. | 2435 2006 2002 71 1 1 71 | 26. | 2435 2007 2002 116 1 1 116 | 27. | 2435 2008 2002 118 1 1 118 | 28. | 2435 2009 2002 1165 1 1 1165 | 29. | 2435 2010 2002 67 1 1 67 | |------------------------------------------------------------| 30. | 2436 2002 1992 24 . 1 24 | 31. | 2436 2003 1992 25 1 1 25 | 32. | 2436 2004 1992 22 1 1 22 | 33. | 2436 2005 1992 23 1 1 23 | 34. | 2436 2006 1992 21 1 1 21 | 35. | 2436 2007 1992 100 1 1 100 | 36. | 2436 2008 1992 73 1 1 73 | 37. | 2436 2009 1992 23 1 1 23 | 38. | 2436 2010 1992 40 1 1 40 | |------------------------------------------------------------| 39. | 2437 2002 2002 30 . 1 30 | 40. | 2437 2003 2002 31 1 1 31 | 41. | 2437 2004 2002 21 1 1 21 | 42. | 2437 2005 2002 . 2 0 38.5 | 43. | 2437 2006 2002 56 2 1 56 | 44. | 2437 2007 2002 20 1 1 20 | +------------------------------------------------------------+语句加密您的试用代码;并开发一个好的玩具示例数据集（正如您在这里提供的那样）。

{{1}}

Answer 2

如果您的示例数据库中没有任何特定公司连续几年缺失，则以下情况有效。我还假设变量Lack为数字，最终结果是不平衡的面板（您在问题中没有具体说明这一点）。

* Expand database
expand 2 if Lack == 1, gen(x)
gsort firm_id year -x

* Substitution rules
replace year = year - 1 if x == 1
replace total_workers = (total_workers[_n-1] + total_workers[_n+1])/2 if x == 1

list, sepby(firm_id)

expand行可以重写为expand Lack + 1, gen(x)，但也许更清楚。

对于连续几年缺失的更一般情况，以下内容应该假设Lack指定缺失的连续年数。例如，如果某个公司从2006年到2009年有一个跳跃，那么2009年的观察就会Lack = 2。

* Expand database
expand Lack + 1, gen(x)
gsort firm_id year -x

* Substitution rules
replace year = year[_n-1] + 1 if x == 1

现在您只需要为total_workers：

提出一个估算规则

replace total_workers = ...

如果Lack是字符串，请使用real转换为数字。

Answer 3

您已经给出了答案，但我之前必须做类似的事情，并始终使用cross命令，如下所示。假设我已经在使用您的数据集＆amp;继续以下代码：

tempfile master year
save `master'
preserve
keep year
duplicates drop
save `year'

restore
//next two lines set me up to correct for different year ranges by firm; if year ranges were standard, this would be omitted
bys firm_id: egen minyear=min(year)
bys firm_id: egen maxyear=max(year)
keep firm_id minyear maxyear
duplicates drop
cross using `year'
merge m:1 firm_id year using `master', assert(1 3) nogen
drop if year<minyear | year>maxyear //this adjusts for years outside the earliest and latest years observed by firm; if year ranges standard, again omitted

然后从这里开始，按照@NickCox的精神使用ipolate命令。

我对使用expand和cross的任何利弊特别感兴趣。（除了我在这里的使用特别取决于每年被观察到的> 0记录以构建交叉数据集的事实，如果我以不同方式创建`year'临时文件，这可以被消除。）

添加具有变量特定值的观察值

3 个答案: