How to backfill data

时间:2015-09-14 15:36:18

标签: stata

I have data that looks something like:

n   year    y
1   2000    
1   2000        
1   2001    
1   2002    6
1   2002    6
1   2003    9
2   2000    
2   2000    
2   2001        
2   2002    1
2   2002    9
2   2003    4
3   2000    
3   2001    
3   2002    3
3   2002    3
3   2003    5
3   2003    5
4   1999    
4   2000    
4   2001    
4   2002    
4   2002    4

How can I fill in the y value for all years before 2002 with the y value corresponding to the ~first~ observation of 2002 - and do this by n?

For example, for n==2, the first y value of year==2002 is 1. Thus, I would want to fill in the three y values of years 2000 (2) and 2001 (1) with 1. The new dataset would be:

n   year    y
1   2000    6
1   2000    6   
1   2001    6
1   2002    6
1   2002    6
1   2003    9
2   2000    1
2   2000    1
2   2001    1   
2   2002    1
2   2002    9
2   2003    4
3   2000    3
3   2001    3
3   2002    3
3   2002    3
3   2003    5
3   2003    5
4   1999    
4   2000    
4   2001    
4   2002    
4   2002    4

Note that the years before 2002 for n==4 did not get filled in because the first observation where year==2002 is blank.

I think that a solution may be along the lines of:

bysort n: gen temp = y[1] if year==2002
replace y = temp if year<2002
drop temp

But I am not sure about the first line.

2 个答案:

答案 0 :(得分:1)

One (perhaps inelegant) solution:

sort n year, stable // [1]
gen y2 = y
by n year: gen _y = y2[1] if year == 2002 // [2]
egen _y2 = max(_y), by(n) // [3]
replace y2 = _y2 if year < 2002 // [4]
drop _*

li, sepby(n) noobs

yielding:

  +-------------------+
  | n   year   y   y2 |
  |-------------------|
  | 1   2000   .    6 |
  | 1   2000   .    6 |
  | 1   2001   .    6 |
  | 1   2002   6    6 |
  | 1   2002   6    6 |
  | 1   2003   9    9 |
  |-------------------|
  | 2   2000   .    1 |
  | 2   2000   .    1 |
  | 2   2001   .    1 |
  | 2   2002   1    1 |
  | 2   2002   9    9 |
  | 2   2003   4    4 |
  |-------------------|
  | 3   2000   .    3 |
  | 3   2001   .    3 |
  | 3   2002   3    3 |
  | 3   2002   3    3 |
  | 3   2003   5    5 |
  | 3   2003   5    5 |
  |-------------------|
  | 4   1999   .    . |
  | 4   2000   .    . |
  | 4   2001   .    . |
  | 4   2002   .    . |
  | 4   2002   4    4 |
  +-------------------+

Notes:
[1] The stable option preserves the ordering of y.
[2] Generates _y equal to the first observation where year == 2002 only. Note that you need by n year or else y[1] is the first observation of the n group even when year != 2002 (but present only for observations where year == 2002).
[3] Fills in _y across the n group.
[4] Replaces y2 for years earlier than 2002.

答案 1 :(得分:1)

来自SSC的

mipolate提供“向后”插值,如下所示:

. ssc inst mipolate 

. bysort n: mipolate y year, gen(y2) backward

. l

     +-------------------+
     | n   year   y   y2 |
     |-------------------|
  1. | 1   2000   .    6 |
  2. | 1   2000   .    6 |
  3. | 1   2001   .    6 |
  4. | 1   2002   6    6 |
  5. | 1   2002   6    6 |
     |-------------------|
  6. | 1   2003   9    9 |
  7. | 2   2000   .    5 |
  8. | 2   2000   .    5 |
  9. | 2   2001   .    5 |
 10. | 2   2002   1    5 |
     |-------------------|
 11. | 2   2002   9    5 |
 12. | 2   2003   4    4 |
 13. | 3   2000   .    3 |
 14. | 3   2001   .    3 |
 15. | 3   2002   3    3 |
     |-------------------|
 16. | 3   2002   3    3 |
 17. | 3   2003   5    5 |
 18. | 3   2003   5    5 |
 19. | 4   1999   .    4 |
 20. | 4   2000   .    4 |
     |-------------------|
 21. | 4   2001   .    4 |
 22. | 4   2002   .    4 |
 23. | 4   2002   4    4 |
     +-------------------+

我提到这一点是因为对这个问题感兴趣的其他人可能会感兴趣。这里的关键是首先对同一标识符和年份的多个观察值进行平均,这不是您想要的。

你问题的特定版本是高度脆弱,因为不知怎的,你知道几个的第一个值是要使用的那个,但是你向我们展示的数据中没有任何标记是什么或为什么。对n year上的数据进行排序,以及首先出现的各种副本中的哪一个可能会发生变化!这是数据管理的危险情况。