Stata查询:需要帮助创建一个新变量,该变量依赖于同一家庭中不同行的数据

时间:2015-04-09 08:33:38

标签: stata

我想在我的横截面调查数据集中创建一个新列,其中包括对女性丈夫的教育。我有家庭(隐藏)和个人(HL1)的ID,以及以下信息:

  • MA1 ==女性是否结婚(女性只能观察到数据)
  • MA2 ==丈夫的年龄(仅对已婚女性可观察到的数据)
  • HL4 ==性别(所有人都能观察到的数据)
  • HL6 ==年龄(所有人都可观察到的数据)
  • ED4A ==最高教育水平(所有人都能观察到的数据)

本质上,我想创建代码来执行以下操作:

  • 首先看看妻子目前是否已婚(MA1)
  • 如果是,那么看看丈夫的年龄(MA2)
  • 然后将丈夫(MA2)的年龄与家庭中的男性年龄(HL6)配对
  • 然后看看那个男性的教育是什么(ED4A),并把这个教育放在一个新专栏中,但与女性的行号相同。

我尝试了这个,但它不起作用: bysort hid (HL6) : gen husb_educ = ED4A[MA2]

以下是数据集中的示例:

+-----+----------+-----+-----+--------+-----+----------+
| HL1 |   MA1    | MA2 | hid |  HL4   | HL6 |   ED4A   |
+-----+----------+-----+-----+--------+-----+----------+
|   1 |          |     | 106 | Male   |  57 | Diploma  |
|   2 |          |     | 106 | Female |  53 | Intermed |
|   3 |          |     | 106 | Male   |  30 | Higher S |
|   4 | No, not  |     | 106 | Female |  24 | Bachelor |
|   5 |          |     | 106 | Male   |  22 | Diploma  |
|   6 |          |     | 106 | Male   |  17 | Secondar |
|   7 |          |     | 106 | Female |  10 | Primary  |
|   8 | Yes, cur |  22 | 106 | Female |  23 | Diploma  |
|   9 |          |     | 106 | Female |   0 |          |
+-----+----------+-----+-----+--------+-----+----------+

所以在这个例子中,我想要一个新的专栏,其中说明了丈夫的教育,并且在第8行中,将文凭作为新专栏中的价值(因为这位女性的丈夫已经22岁了, 22岁的男性在家中有文凭。

相同样本,没有值标签:

+-----+-----+-----+-----+-----+-----+------+
| HL1 | MA1 | MA2 | hid | HL4 | HL6 | ED4A |
+-----+-----+-----+-----+-----+-----+------+
|   1 |     |     | 106 |   1 |  57 |    4 |
|   2 |     |     | 106 |   2 |  53 |    2 |
|   3 |     |     | 106 |   1 |  30 |    6 |
|   4 |   3 |     | 106 |   2 |  24 |    5 |
|   5 |     |     | 106 |   1 |  22 |    4 |
|   6 |     |     | 106 |   1 |  17 |    3 |
|   7 |     |     | 106 |   2 |  10 |    1 |
|   8 |   1 |  22 | 106 |   2 |  23 |    4 |
|   9 |     |     | 106 |   2 |   0 |      |
+-----+-----+-----+-----+-----+-----+------+

一个特别大的家庭:

    input
HL1 MA1 MA2 hid     HL4 HL6 ED4A
1   .   .   365809  1   33  1
2   1   33  365809  2   26  1
1   .   .   365810  1   58  1
2   .   .   365810  2   54  .
3   .   .   365810  1   23  3
4   .   .   365810  1   23  2
5   .   .   365810  1   18  3
6   .   .   365810  1   15  2
7   .   .   365810  2   12  2
8   .   .   365810  1   33  3
9   1   dk  365810  2   31  1
10  .   .   365810  2   13  2
11  .   .   365810  2   11  1
12  .   .   365810  1   9   1
13  .   .   365810  1   6   1
14  .   .   365810  2   3   .
15  .   .   365810  1   2   .
16  .   .   365810  1   33  3
17  1   33  365810  2   30  1
18  .   .   365810  1   8   1
19  .   .   365810  2   6   1
20  .   .   365810  2   5   .
21  .   .   365810  1   1   .
22  .   .   365810  1   32  4
23  1   32  365810  2   30  1
24  .   .   365810  1   5   .
25  .   .   365810  2   3   .
26  .   .   365810  1   2   .
27  .   .   365810  1   30  4
28  1   30  365810  2   28  1
29  .   .   365810  2   2   .
30  .   .   365810  1   0   .
31  .   .   365810  1   27  2
32  1   27  365810  2   27  1
33  .   .   365810  2   2   .
34  .   .   365810  2   0   .
         end 

2 个答案:

答案 0 :(得分:0)

由于您已经概述了执行所需操作所需的步骤,因此编写一个简单的脚本应该不会有问题。 根据我的经验,如果您单独编写/执行每个步骤(并查看每个步骤后发生的情况,如果引入任何错误等),则更容易学习语法。掌握它之后,您可以将代码缩小到一行。这样的事情应该有效(尝试按照你的问题中的步骤):

*look at wife currently married
*not necessary, as only married women have MA2, but next step takes only married women into account

* generate husbands age variable and spread to whole household (new var to keep original MA2 untouched)
gen husband_age=MA2 if MA1==married & HL4==woman
bys hid: egen husband_age_hid=max(husband_age)

*mark which individual is the husband (assumed this is what was meant by pairing age of husband with age of male in household)
gen husband=0
bys hid: replace husband = 1 if husband_age_hid == HL6

*copy husbands education information to the whole household
gen husband_ED4 = ED4 if husband==1
bys hid: egen husb_educ=max(husband_ED4)

*data cleaning, if necessary
drop husband*

可能更好地使用tempvars而不是在第一步中生成新变量,但认为这些变量以后可能会有用。

答案 1 :(得分:0)

这是一个开始。该守则确实循环于每个家庭中的不同已婚妇女,但如果两个或更多男性与丈夫的年龄相匹配则无效。

input  HL1  MA1  MA2  hid  HL4  HL6  ED4A 
  1    .   .     106    1   57     4 
  2    .   .     106    2   53     2 
  3    .   .     106    1   30     6 
  4    3   .     106    2   24     5 
  5    .   .     106    1   22     4 
  6    .   .     106    1   17     3 
  7    .   .     106    2   10     1 
  8    1  22     106    2   23     4 
  9    .   .     106    2    0     .    
 end 

bysort hid (MA1) : gen wid = _n if MA1 == 1 

su wid, meanonly 

local max = r(max) 

gen heducation = . 

quietly forval i = 1/`max' { 
    bysort hid : egen hage = min(cond(wid == `i', MA2, .)) 
    by hid : egen nmatches = total(HL4 == 1 & HL6 == hage) 
    by hid : egen work = min(cond(nmatches == 1 & HL6 == hage, ED4, .)) 
    replace heducation = work if wid == `i' 
    drop hage nmatches work 
}

sort hid HL1 

list 

     +-----------------------------------------------------------+
     | HL1   MA1   MA2   hid   HL4   HL6   ED4A   wid   heduca~n |
     |-----------------------------------------------------------|
  1. |   1     .     .   106     1    57      4     .          . |
  2. |   2     .     .   106     2    53      2     .          . |
  3. |   3     .     .   106     1    30      6     .          . |
  4. |   4     3     .   106     2    24      5     .          . |
  5. |   5     .     .   106     1    22      4     .          . |
     |-----------------------------------------------------------|
  6. |   6     .     .   106     1    17      3     .          . |
  7. |   7     .     .   106     2    10      1     .          . |
  8. |   8     1    22   106     2    23      4     1          4 |
  9. |   9     .     .   106     2     0      .     .          . |
     +-----------------------------------------------------------+

(更新)

扩展示例发现了一个错误:一项计算不够限制,不排除年龄相同的女性。 (顺便提一下,请注意新数据是针对两个家庭,而不是一个。)

bysort hid (MA1) : gen wid = _n if MA1 == 1 

su wid, meanonly 

local max = r(max) 

gen heducation = . 

quietly forval i = 1/`max' { 
    bysort hid : egen hage`i' = min(cond(wid == `i', MA2, .)) 
    by hid : egen nmatches`i' = total(HL4 == 1 & HL6 == hage`i') 
    by hid : egen work`i' = min(cond(nmatches`i' == 1 & HL6 == hage`i' & HL4 == 1, ED4, .)) 
    replace heducation = work`i' if wid == `i' 
}

sort hid wid HL1 

    list hid wid MA2 HL6 ED4 heducation HL4 if inlist(HL6, 27, 30, 32, 33) | MA2 < ., sepby(hid) 

     +--------------------------------------------------+
     |    hid   wid   MA2   HL6   ED4A   heduca~n   HL4 |
     |--------------------------------------------------|
  1. | 365809     1    33    26      1          1     2 |
  2. | 365809     .     .    33      1          .     1 |
     |--------------------------------------------------|
  3. | 365810     1    27    27      1          2     2 |
  4. | 365810     2    33    30      1          .     2 |
  5. | 365810     3    32    30      1          4     2 |
  6. | 365810     4    30    28      1          4     2 |
 14. | 365810     .     .    33      3          .     1 |
 21. | 365810     .     .    33      3          .     1 |
 26. | 365810     .     .    32      4          .     1 |
 30. | 365810     .     .    30      4          .     1 |
 33. | 365810     .     .    27      2          .     1 |
     +--------------------------------------------------+

有关更一般性的讨论,请参阅

hereherehere