Question

我有一个数据集，显示在特定周和特定商店中每种产品类别支付了多少（“cenoz” - 每盎司美分）。

clear
set more off
input week  store   cenoz   category
        1      1      2         1
        1      1      4         2
        1      1      3         3
        1      2      5         1
        1      2      7         2
        1      2      8         3
        2      1      4         1
        2      1      1         2
        2      1      10        3
        2      2      3         1
        2      2      4         2
        2      2      7         3
        3      1      5         1
        3      1      3         2
        3      2      5         1
        3      2      4         2
end

我创建了一个新变量cenoz3，表示在特定周和商店中为类别3支付的平均费用。与cenoz1和cenoz2相同。

egen cenoz1 = mean(cenoz/ (category == 1)), by(week store) 
egen cenoz2 = mean(cenoz/ (category == 2)), by(week store) 
egen cenoz3 = mean(cenoz/ (category == 3)), by(week store)

事实证明，在第3周，任何商店（1和2）都没有销售类别3.结果，产生了缺失值。

week    store   cenoz   category    cenoz1  cenoz2  cenoz3
  1       1       2        1           2       4      3
  1       1       4        2           2       4      3
  1       1       3        3           2       4      3
  1       2       5        1           5       7      8
  1       2       7        2           5       7      8
  1       2       8        3           5       7      8
  2       1       4        1           4       1      10
  2       1       1        2           4       1      10
  2       1       10       3           4       1      10
  2       2       3        1           3       4      7
  2       2       4        2           3       4      7
  2       2       7        3           3       4      7
  3       1       5        1           5       3      .
  3       1       3        2           5       3      .
  3       2       5        1           5       4      .
  3       2       4        2           5       4      .

我想用前一周的值和匹配的商店替换特定周的缺失值。那就是说：

replace missing values for category 3 in week 3 in store 1 
           with values for category 3 in week 2 in store 1

和

replace missing values for category 3 in week 3 in store 2 
           with values for category 3 in week 2 in store 2

我可以使用命令replace还是比它更复杂？

类似的东西：

replace cenoz1 = cenoz1[_n-1] if missing(cenoz1)

但我还需要商店来匹配，而不仅仅是时间变量week。

我发现此代码由Nicholas Cox提供 http://www.stata.com/support/faqs/data-management/replacing-missing-values/：

by id (time), sort: replace myvar = myvar[_n-1] if myvar >= .

你认为

 by store (week), sort: cenoz1 = cenoz1[_n-1] if missing(cenoz1)

就足够了吗？

更新：

当我使用代码时

by store (week category), sort: replace cenoz3 = cenoz3[_n-1] if missing(cenoz3)

它似乎提供了正确的值：

week    store   cenoz   category    cenoz1  cenoz2  cenoz3
  1       1       2        1           2       4      3
  1       1       4        2           2       4      3
  1       1       3        3           2       4      3
  1       2       5        1           5       7      8
  1       2       7        2           5       7      8
  1       2       8        3           5       7      8
  2       1       4        1           4       1      10
  2       1       1        2           4       1      10
  2       1       10       3           4       1      10
  2       2       3        1           3       4      7
  2       2       4        2           3       4      7
  2       2       7        3           3       4      7
  3       1       5        1           5       3      10
  3       1       3        2           5       3      10
  3       2       5        1           5       4      7
  3       2       4        2           5       4      7

考虑到我的数据集非常大，有没有办法仔细检查这段代码？

如果找到丢失的代码，如何使此代码不那么具体但适用于任何缺失的cenoz？ (cenoz1, cenoz2, cenoz3, cenoz4...cenoz12)

Answer 1

如果您想将同一商店和同一类别的先前信息用于

by store category (week), sort: replace cenoz3 = cenoz3[_n-1] if missing(cenoz3)

概括可能是

sort store category week 
forval j = 1/12 { 
    by store category: replace cenoz`j' = cenoz`j'[_n-1] if missing(cenoz`j') 
}

然而，这种结转是一种相当粗略的插值方法。考虑线性，三次，三次样条，PCHIP插值方法。使用search查找Stata程序。

Answer 2

快速说明您的代码

的原因

by store (category week), sort: replace cenoz3 = cenoz3[_n-1] if missing(cenoz3)

没有工作。

它适用于您提供的示例数据集。但稍作修改可能会产生意想不到的结果。请考虑以下示例：

clear all
set more off

input week  store   cenoz   category
        1      1      2         1
        1      1      4         2 /* 
        1      1      3         3 deleted observation */
        1      2      5         1
        1      2      7         2
        1      2      8         3
        2      1      4         1
        2      1      1         2
        2      1      10        3
        2      2      3         1
        2      2      4         2
        2      2      7         3
        3      1      5         1
        3      1      3         2
        3      1    999          3 // new observation
        3      2      5         1
        3      2      4         2
end 

egen cenoz1 = mean(cenoz/ (category == 1)), by(week store) 
egen cenoz2 = mean(cenoz/ (category == 2)), by(week store) 
egen cenoz3 = mean(cenoz/ (category == 3)), by(week store) 

order store category week
sort store category week
list, sepby(store category)

*----- method 1 (your code) -----

gen cenoz3x1 = cenoz3
by store (category week), sort: replace cenoz3x1 = cenoz3x1[_n-1] if missing(cenoz3x1) 

*----- method 2 (Nick's code) -----

gen cenoz3x2 = cenoz3
by store category (week), sort: replace cenoz3x2 = cenoz3x2[_n-1] if missing(cenoz3x2) 

list, sepby(store category)

方法1会将类别1 文章的价格分配给类别2 文章（cenoz3x1的观察4）。据推测，你不想要的东西。如果您想避免这种情况，那么这些群组应该基于store category，而不仅仅是store。

开始阅读的最佳位置是help和手册。

如果当前周缺少值，如何用前一周的值替换它？

2 个答案: