Question

我有一个18个时间段的纵向数据集。由于此处不讨论的原因，此数据集采用宽形状，而不是长形状。更确切地说，时变变量具有字母前缀，用于标识其所属的时间。为了这个问题，考虑一个名为pay的兴趣量。此变量在第一个句点中表示为apay，在第二个句点中表示为bpay，依此类推，直到rpay。

重要的是，不同的观察结果在不同时期以不可预测的方式丢失了该变量中的值。因此，在整个期间内运行一个小组将大大减少我的观察数量。因此，我想知道具有不同长度的面板将具有多少观察值。为了评估这一点，我想创建变量，表示每个时段和每个连续时段的数量计算有多少受访者拥有该时间序列的变量。例如，我希望变量b_count_2计算在第一个时段和第二个时段中有多少个观察值没有丢失pay。这可以通过以下方式实现：

local b_count_2 = 0
if apay != . & bpay != . {
        local b_count_2 = `b_count_2' + 1 // update for those with nonmissing pay in both periods
    }

现在，由于我想自动执行此操作，因此必须处于循环中。而且，每个时期有不同数量的序列。例如，对于第三个时期，有两个序列（在第2和第3期有付费的那些，在第1,2和3期有序列的那些）。因此，要创建的变量数是1 + 2 + 3 + 4 + ... + 17 = 153.这种可变性必须反映在循环中。我在下面提出了一个代码，但有些内容是错误的，或者我不确定，正如评论中所强调的那样。

local list b c d e f g h i j k l m n o p q r               // periods over which iterate
foreach var of local list {                 // loop over periods
    local counter = 1                   // counter to update; reflects sequence length 
    while `counter' > 0 {                   // loop over sequence lengths
        gen _`var'_counter_`counter' = 0        // generate variable with counter
        if `var'pay != . {              // HERE IS PROBLEM 1. NEED TO MAKE THIS TO CHECK CONDITIONS WITH INCREASING NUMBER OF ELEMENTS 
            recode _`var'_counter_`counter' (0 = 1) // IM NOT SURE THIS IS HOW TO UPDATE SPECIFIC OBSERVATIONS.
            local counter = `counter' - 1       // update counter to look for a longer sequence in the next iteration
        }
    }
    local counter = `counter' + 1               // HERE IS PROBLEM 2. NEED TO STOP THIS LOOP! Otherwise counter goes to infinity.
}

以上代码的结果示例（如果正确）如下。考虑五个观察的数据集，分为四个时期（表示为a，b，c和d）：

Obs   a  b  c  d
1     1  1  .  1
2     1  1  .  .
3     .  .  1  1
4     .  1  1  .
5     1  1  1  1

其中1表示在该期间内观察到的值，并且。不是。代码的目标是创建1 + 2 + 3 = 6个新变量，以便新数据集为：

Obs   a  b  c  d  b_count_2  c_count_2  c_count_3  d_count_2  d_count_3  d_count_4
1     1  1  .  1      1          0          0          0          0          0
2     1  1  .  .      1          0          0          0          0          0
3     .  .  1  1      0          0          0          1          0          0
4     .  1  1  .      0          1          0          0          0          0
5     1  1  1  1      1          1          1          1          1          1

现在，为什么这有用？好吧，因为现在我可以运行一组summarize命令来获得对数据集的非常好的描述。一次性打印此信息的代码如下：

local list a b c d e f g h i j k l m n o p q r                  // periods over which iterate
foreach var of local list {                         // loop over periods
    local list `var'_counter_*                      // group of sequence variables for each period
    foreach var2 of local list {                        // loop over each element of the list
        quietly sum `var'_counter_`var2' if `var'_counter_`var2' == 1   // sum the number of individuals with value = 1 with sequence of length var2 in period var
        di as text "Wave `var' has a sequence of length `var2' with " as result r(N) as text " observations." // print result
    }
}

对于上面的示例，这会产生以下输出：

"Wave 'b' has a sequence of length 2 with 3 observations."
"Wave 'c' has a sequence of length 2 with 2 observations."
"Wave 'c' has a sequence of length 3 with 1 observations."
"Wave 'd' has a sequence of length 2 with 2 observations."
"Wave 'd' has a sequence of length 3 with 1 observations."
"Wave 'd' has a sequence of length 4 with 1 observations."

这给了我一个很好的总结，说明了我在更宽的面板和更长的面板之间的权衡。

Answer 1

我回应@Dimitriy V.Masterov，你正在使用这个数据集形状。它可以方便地用于某些目的，但是对于像你这样的面板或纵向数据，在Stata中使用它最多是尴尬的，最坏的是不可行的。

首先，特别注意

local b_count_2 = 0
if apay != . & bpay != . {
        local b_count_2 = `b_count_2' + 1 // update for those with nonmissing pay in both periods
}

只会根据第一次观察进行评估，即好像你已编码

if apay[1] != . & bpay[1] != .

记录在案here。即使它是你想要的，通常也不是其他人遵循的模式。

其次，更一般地说，我没有尝试理解代码的所有细节，因为我看到的是即使对于像草图中的微小数据集也创建了大量变量。对于一系列 T 个句点，您将创建一个三角形数字[（ T - 1） T ] / 2个新变量;在你的例子中（17 x 18）/ 2 = 153.如果有人有100个句号长的系列，他们将需要4950个新变量。

请注意，由于刚刚提出的第一点，这些新变量仅适用于您的策略，例如pay 和个别面板。据推测，对个别小组的限制可能是固定的，但主要观点在许多方面似乎是非常不明智的。简而言之，除了编写更多嵌套循环之外，您还需要采用什么策略来处理这些数百或数千个新变量？

您的主要需求似乎是识别非遗漏和缺失值的法术。自开发以来，这种机器很容易实现。讨论了一般原则in this paper，可以从SSC下载tsspell的实现。

在Statalist上，人们被要求提供可行的数据和代码示例。请参阅this FAQ这完全等同于MCVE的长期请求。

尽管有这些建议，但我首先要查看已经可用的Stata命令xtdescribe和相关的xt工具。这些工具确实需要长数据形状，reshape将为您提供。

Answer 2

如果你坚持用宽泛的数据来做这件事，那么创建额外的变量只是为了计算缺失值的模式是非常低效的。您可以创建包含每个观察的模式的单个字符串变量。然后，只需从该模式变量中提取您要查找的内容（即直到当前波的连续周期的模式）。然后，您可以遍历匹配模式的长度并进行计数。类似的东西：

* create some fake data
clear
set seed 12341
set obs 10
foreach pre in a b c d e f g {
    gen `pre'pay = runiform() if runiform() < .8
}

* build the pattern of missing data
gen pattern = ""
foreach pre in a b c d e f g {
    qui replace pattern = pattern + cond(mi(`pre'pay), " ", "`pre'")
}
list

qui foreach pre in b c d e f g {
    noi dis "{hline 80}" _n as res "Wave `pre'"

    // the longest substring without a space up to the wave
    gen temp = regexs(1) if regexm(pattern, "([^ ]+`pre')")
    noi tab temp

    // loop over the various substring lengths, from 2 to max length
    gen len = length(temp)
    sum len, meanonly
    local n = r(max)
    forvalues i = 2/`n' {
        count if length(temp) >= `i'
        noi dis as txt "length = " as res `i' as txt " obs = " as res r(N)
    }
    drop temp len
}

如果您愿意以长篇形式工作，那么您可以通过以下方式识别具有连续数据的法术以及如何循环获取所需信息（数据设置与上述完全相同）：

* create some fake data in wide form
clear
set seed 12341
set obs 10
foreach pre in a b c d e f g {
    gen `pre'pay = runiform() if runiform() < .8
}

* reshape to long form
gen id = _n
reshape long @pay, i(id) j(wave) string

* identify spells of contiguous periods
egen wavegroup = group(wave), label 
tsset id wavegroup  
tsspell, cond(pay < .)
drop if mi(pay)

foreach pre in b c d e f g {
    dis "{hline 80}" _n as res "Wave `pre'"

    sum _seq if wave == "`pre'", meanonly
    local n = r(max)
    forvalues i = 2/`n' {
        qui count if _seq >= `i' & wave == "`pre'"
        dis as txt "length = " as res `i' as txt " obs = " as res r(N)
    }

}

Answer 3

让我根据现在添加到问题中的示例添加另一个答案。

Obs   a  b  c  d
1     1  1  .  1
2     1  1  .  .
3     .  .  1  1
4     .  1  1  .
5     1  1  1  1

这个答案的目的不是提供OP要求的内容，而是指出有多少简单工具可用于查看非缺失值和缺失值的模式，其中没有一个需要创建大量额外变量或为每个新问题编写基于嵌套循环的复杂代码。大多数这些工具都需要reshape long。

. clear  

. input a b c d

             a          b          c          d
  1.  1 1 . 1
  2.  1 1 . .
  3.  . . 1 1
  4.  . 1 1 .
  5.  1 1 1 1
  6. end 

. rename (a b c d) (y1 y2 y3 y4) 

. gen id = _n 

. reshape long y, i(id) j(time) 
(note: j = 1 2 3 4)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        5   ->      20
Number of variables                   5   ->       3
j variable (4 values)                     ->   time
xij variables:
                           y1 y2 ... y4   ->   y
-----------------------------------------------------------------------------

. xtset id time 
       panel variable:  id (strongly balanced)
        time variable:  time, 1 to 4
                delta:  1 unit

. preserve 

. drop if missing(y) 
(7 observations deleted)

. xtdescribe 

      id:  1, 2, ..., 5                                      n =          5
    time:  1, 2, ..., 4                                      T =          4
           Delta(time) = 1 unit
           Span(time)  = 4 periods
           (id*time uniquely identifies each observation)

Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                         2       2       2         2         3       4       4

     Freq.  Percent    Cum. |  Pattern
 ---------------------------+---------
        1     20.00   20.00 |  ..11
        1     20.00   40.00 |  .11.
        1     20.00   60.00 |  11..
        1     20.00   80.00 |  11.1
        1     20.00  100.00 |  1111
 ---------------------------+---------
        5    100.00         |  XXXX

* ssc inst xtpatternvar 
. xtpatternvar, gen(pattern) 

* ssc inst groups 
. groups pattern

  +------------------------------------+
  | pattern   Freq.   Percent     % <= |
  |------------------------------------|
  |    ..11       2     15.38    15.38 |
  |    .11.       2     15.38    30.77 |
  |    11..       2     15.38    46.15 |
  |    11.1       3     23.08    69.23 |
  |    1111       4     30.77   100.00 |
  +------------------------------------+

. restore  

. egen npresent = total(missing(y)), by(time)

. tabdisp time, c(npresent) 

----------------------
     time |   npresent
----------+-----------
        1 |          2
        2 |          1
        3 |          2
        4 |          2
----------------------

嵌套循环，元素数量越来越多

3 个答案: