我有三个变量(ID,Name和City),需要generate
一个新的变量标志。
观察结果有问题。我需要找到错误的观察结果并创建标志。变量标志指示哪个列包含错误的观察结果。
假设每行最多只有一次不良观察。
鉴于脏数据!!!!!
|ID |Name |City
|1 |IBM |D
|1 |IBM |D
|2 |IBM |D
|3 |Google |F
|3 |Microsoft |F
|3 |Google |F
|8 |Microsoft |A
|8 |Microsoft |B
|8 |Microsoft |A
结果
|ID |Name |City |flag
|1 |IBM |D |0
|1 |IBM |D |0
|2 |IBM |D |1
|3 |Google |F |0
|3 |Microsoft |F |2
|3 |Google |F |0
|8 |Microsoft |A |0
|8 |Microsoft |B |3
|8 |Microsoft |A |0
答案 0 :(得分:3)
以下是Stata的答案,它依赖于您在评论中指出的许多假设,但不是最初的问题:
clear all
input float ID str9 Name str1 City
1 "IBM" "D"
1 "IBM" "D"
2 "IBM" "D"
3 "Google" "F"
3 "Microsoft" "F"
3 "Google" "F"
8 "Microsoft" "A"
8 "Microsoft" "B"
8 "Microsoft" "A"
end
// get dummy variable for
duplicates tag, gen(right)
gen flag = 0
encode Name, gen(Name_n)
encode City, gen(City_n)
qui sum
forvalues start = 1(3)`r(N)' {
local end = `start'+2
// check if ID is all same
qui sum ID in `start'/`end'
if `r(sd)' != 0 {
replace flag = 1 in `start'/`end' if right == 0
continue
}
// check if name is all same
qui sum Name_n in `start'/`end'
if `r(sd)' != 0 {
replace flag = 2 in `start'/`end' if right == 0
continue
}
// chech if city is all same
qui sum City_n in `start'/`end'
if `r(sd)' != 0 {
replace flag = 3 in `start'/`end' if right == 0
continue
}
}
drop right Name_n City_n
直觉是因为它们被分为3个,两个总是正确的,每组3个只有一个问题,它们按ID分类,这可能是错误的但不大于我们可以的下一个最大的权利ID首先检查重复,如果有重复的观察,那么观察是正确的。
接下来,(在forvalues循环中)我们遍历每组三个以查看哪个变量具有错误的值,当我们找到它时,我们用适当的数字替换flag。
答案 1 :(得分:2)
此代码基于Eric的回答。
clear all
input float ID str9 Name str1 City
1 "IBM" "D"
1 "IBM" "D"
2 "IBM" "D"
3 "Google" "F"
3 "Microsoft" "F"
3 "Google" "F"
8 "Microsoft" "A"
8 "Microsoft" "B"
8 "Microsoft" "A"
end
encode Name, gen(Name_n)
encode City, gen(City_n)
// get dummy variable for
duplicates tag ID Name, gen(col_12)
duplicates tag ID City, gen(col_13)
duplicates tag Name City, gen(col_23)
duplicates tag ID Name City, gen(col_123)
// generate the flag
gen flag = 0
replace flag = 1 if col_123 == 0 & col_23 ~= 0
replace flag = 2 if col_123 == 0 & col_13 ~= 0
replace flag = 3 if col_123 == 0 & col_12 ~= 0
drop Name_n City_n col_*