Question

我正在处理超过200,000行的数据集。我试图根据某些条件用下一行或前一行中的值替换缺失值。下面的循环只运行一次，但只要指定变量缺少值，我希望它继续运行。数据如下所示：

ID  primary_ins primary_ins_collegecodebranch
36  GROSSMONT COLLEGE   120800
37  GROSSMONT COLLEGE   120800
38  GROSSMONT COLLEGE   120800
39  SAN DIEGO STATE UNIVERSITY  
40  SAN DIEGO STATE UNIVERSITY             
41  SAN DIEGO STATE UNIVERSITY  115100
42  DIEGO STATE UNIVERSITY  115100
43  GROSSMONT COLLEGE   120800
44  GROSSMONT COLLEGE   120800
45  FRESNO CITY COLLEGE 130700



gen primary_ins_collegecodebranch=collegecodebranch if primary_ins==college
    foreach x of varlist primary_ins_collegecodebranch{
        replace primary_ins_collegecodebranch=primary_ins_collegecodebranch[_n+1] if missing(primary_ins_collegecodebranch) & primary_ins==primary_ins[_n+1]
        replace primary_ins_collegecodebranch=primary_ins_collegecodebranch[_n-1] if missing(primary_ins_collegecodebranch) & primary_ins==primary_ins[_n-1]
    }

Answer 1

这还不太清楚。例如，您不需要在简单的层面上解释数据的基本结构（学生？课程等）。标识符是什么意思？他们是否提供信息？似乎您的数据片段中存在一些缺失，但是变量是字符串还是数字并不明确。因此，该示例未达到https://stackoverflow.com/help/mcve

foreach循环恰好是一个变量的循环，只执行一次。在这种情况下，它是多余的，特别是因为循环不包括对其本地宏x的引用。

毫无疑问，这里有一些很长的变量名称，但是它们使其他人难以遵循这一点。

使用本地宏定义，我将更简单地显示结构。

local z primary_ins_collegecodebranch
local y collegecodebranch 
gen `z' = `y' if primary_ins==college
replace `z' = `z'[_n+1] if missing(`z') & primary_ins==primary_ins[_n+1]
replace `z'= `z'[_n-1] if missing(`z') & primary_ins==primary_ins[_n-1]

正如你所说，但是使用Stata中使用的术语，问题是通过相邻值插值缺失值。

参见上一个帖子replace missing value based on linear prediction of nearby cells

我不太明白你的插值规则（“基于某些条件”是完全模糊的）但我会打赌不需要循环。查看mipolate（SSC）。你可能会追求它所谓的分组插值，但是需要识别组。

请参阅http://www.stata.com/support/faqs/data-management/replacing-missing-values/，了解您的两个replace语句无法对称运行的原因。

编辑也许你只是想要

如果primary_ins_collegecodebranch是字符串

local z primary_ins_collegecodebranch
bysort primary_ins (`z') : replace `z' = `z'[_N] if missing(`z')

如果primary_ins_collegecodebranch是数字

local z primary_ins_collegecodebranch
bysort primary_ins (`z') : replace `z' = `z'[1] if missing(`z')

这是mipolate（SSC）意义上的分组插值，除了mipolate对字符串不起作用，这就是第一个代码块可能相关的原因。

Stata foreach循环替换缺失值

1 个答案: