Question

我需要从头开始在stata中编写最近邻居算法，因为我的数据集不允许我使用任何可用的解决方案（据我所知）。

要谨慎。我有一个与以下数据结构相似的数据集（原始数据有大约14k个观测值）

input id value treatment match
1 0.14 0 .
2 0.32 0 .
3 0.465 1 2
4 0.878 1 2
5 0.912 1 2
6 0.001 1 1
end

我想生成一个名为match的变量（已包含在上面的示例中）。对于每次治疗观察== 1，变量匹配应该存储治疗范围内另一个观察的id == 0，其值最接近所考虑的观察值（治疗== 1）。

我是stata编程的新手，所以我还不熟悉语法。我的第一个镜头是以下，但它不会对匹配变量产生任何更改。我确信这是一个新手问题，但我希望就如何使代码运行提出一些建议。

编辑：我稍微更改了代码，现在似乎有效了。如果我在更大的数据集上运行它，你会看到任何可能出现的问题吗？

set more off
clear all

input id pscore treatment
1 0.14 0
2 0.32 0
3 0.465 1
4 0.878 1
5 0.912 1
6 0.001 1
end

gen match = .

forval i = 1/`= _N' {
    if treatment[`i'] == 1 {

        local dist 1

        forvalues j = 1/`= _N' {
            if (treatment[`j'] == 0) {
                local current_dist (pscore[`i'] - pscore[`j'])^2

                if `dist' > `current_dist' {
                    local dist `current_dist' // update smallest distance
                    replace match = id[`j'] in `i' // write match

                }

            }
        }
    }   

}

Answer 1

考虑一些模拟数据：1,000个观察结果，其中200个未经处理（treat == 0），其余处理（treat == 1）。然后，下面包含的代码将比最初发布的代码更有效。（与你的代码一样，关系并没有明确处理。）

clear
set more off

*----- example data -----

set obs 1000
set seed 32956

gen id = _n
gen pscore = runiform()
gen treat = cond(_n <= 200, 0, 1)

*----- new method -----

timer clear
timer on 1

// get id of last non-treated and first treated
// (data is sorted by treat and ids are consecutive)
bysort treat (id): gen firsttreat = id[1]
local firstt = first[_N]
local lastnt = `firstt' - 1

// start loop
gen match = .
gen dif = .

quietly forvalues i = `firstt'/`=_N' {

    // compute distances
    replace dif = (pscore[`i'] - pscore)^2
    summarize dif in 1/`lastnt', meanonly

    // identify id of minimum-distance observation
    replace match = . in 1/`lastnt'
    replace match = id in 1/`lastnt' if dif == r(min)
    summarize match in 1/`lastnt', meanonly

    // save the minimum-distance id
    replace match = r(max) in `i'

}

// clean variable and drop
replace match = . in 1/`lastnt'
drop dif firsttreat

timer off 1

tempfile first
save `first'

*----- your method -----

drop match

timer on 2

gen match = .

quietly forval i = 1/`= _N' {
    if treat[`i'] == 1 {

        local dist 1

        forvalues j = 1/`= _N' {
            if (treat[`j'] == 0) {
                local current_dist (pscore[`i'] - pscore[`j'])^2

                if `dist' > `current_dist' {
                    local dist `current_dist' // update smallest distance
                    replace match = id[`j'] in `i' // write match

                }

            }
        }
    }   

}

timer off 2

tempfile second
save `second'

// check for equality of results
cf _all using `first'

// check times
timer list

完成执行的结果：

. timer list
   1:      0.19 /        1 =       0.1930
   2:     10.79 /        1 =      10.7900

差异很大，特别是考虑到这个数据集只有1000个观测值。

有趣的是，随着未处理病例的数量相对于治疗数量的增加，原始方法得到改善，但从未达到新方法的效率水平。例如，反转案例数，因此现在有800个未处理和200个处理（将数据设置更改为gen treat = cond(_n <= 800, 0, 1)）。结果是

. timer list
   1:      0.07 /        1 =       0.0720
   2:      4.45 /        1 =       4.4470

您可以看到新方法也有所改进，但仍然快得多。事实上，相对差异仍然是相同的。

另一种方法是使用joinby或cross。问题是它们会暂时扩大（大量）数据库的大小。在许多情况下，由于Stata对可能观测数量的硬性限制，它们是不可行的（见help limits）。您可以在此处找到joinby的示例：https://stackoverflow.com/a/19784222/2077064。

修改

如果相对于未经治疗的大量治疗，您的代码会受到影响因为你经历了整个第一次循环很多次（由于第一次if）。此外，经历整个循环一次，暗示通过另一个循环有两个if条件，_N次。相反的情况下，很少有经过处理的观察结果意味着你要经历整体仅在少数情况下首次循环，大大加快了代码的使用速度。

我的代码可以保持其效率的原因是由于in的使用。这总是如此提供超过if的速度提升。 Stata将直接进入那些没有的观察需要逻辑检查。您的问题为替换提供了机会抓住它是明智的。

如果我的代码使用if in，则结果会有所不同。你的代码会更快这种情况下，相对于治疗，有大量未经治疗的病例是因为在你的代码中不需要经历完整的循环，只需要很少的工作; 第一个循环与第一个if短路。对于相反的情况，我的代码仍然占主导地位。

关键是＆＃34;分开＆＃34;从未经治疗过治疗，并使用in对每组进行治疗。

Stata中的最近邻匹配

1 个答案:

修改