有效的分类方法

时间:2015-07-11 15:26:49

标签: stata

我有一个sting变量,称之为desc,它带有许多不同的值,比如300.我想创建两个新变量desc_adesc_bdesc包含两类值;我希望将属于第一堂课的内容放在desc_a中,其余内容放在desc_b中。我将介绍一种我想出的方法。但是,这种方法非常慢。我想知道是否有更好的方法来做到这一点。

gen desc_a = ""
gen desc_b = ""
tab desc

生成的标签输出可能会显示(省略不相关的信息):

DESC                  |  Freq.  Perc.  Cum.
___________________________________________
First Element of a       53
Second Element of a      22
First Element of b       78
Third Element of a       232
Second Element of b      33

*手动浏览并将标签的每个字符串复制并粘贴到以下语句中:

replace desc_a = "First Element of a" if desc=="First Element of a"
replace desc_a = "Second Element of a" if desc=="Second Element of a"
replace desc_a = "Third Element of a" if desc=="Third Element of a"
...
replace desc_b = "First Element of b" if desc=="First Element of b"
replace desc_b = "Second Element of b" if desc=="Second Element of b"

请注意,实际数据实际上并没有遵循这样一个漂亮的模式,所以我不能通过正则表达式或类似的东西来自动化它。我确实需要手动检查每个类别并确定它将进入哪个类别。但是,我确实认为我所描述的涉及大量复制和粘贴的方法并不是最好的方法。

3 个答案:

答案 0 :(得分:1)

Stata数据编辑器窗口将有助于减少您的工作量。

创建一个包含两个变量的Stata数据集:desc的300个不同值,以及一个变量,我称之为ab,初始化为missing。然后在Stata数据编辑器中打开数据集,然后查看观察结果,替换(通过键入单元格)缺失值,并指示描述是属于组a还是b(比如1或2)。然后保存该数据集并将其与原始数据集合并,并使用合并值ab将描述分配给适当的变量。

generate desc_a = desc if ab==1
generate desc_b = desc if ab==2

答案 1 :(得分:1)

扩展@ William的解决方案

* recreate your data example
clear
input str19 desc int n
"First Element of a" 53 
"Second Element of a" 22 
"First Element of b " 78 
"Third Element of a" 232 
"Second Element of b" 33 
end
expand n
set seed 314324
gen somedata = runiform()
sort somedata
tab des
tempfile main
save "`main'"

* reduce to one observation per value of desc
bysort desc: keep if _n == 1
keep desc

* make an effort to identify a or b, note that
* the following fails for one obs
gen ab = regexs(1) if regexm(desc,"(a|b)$")

* save and edit manually
tempfile toedit
save "`toedit'"

* this is simulated editing...
clear
input str19 desc str1 ab
"First Element of a" "a" 
"First Element of b " "b" 
"Second Element of a" "a" 
"Second Element of b" "b" 
"Third Element of a" "a" 
end

* now combine with the original data
merge 1:m desc using "`main'", assert(match) nogen

答案 2 :(得分:0)

这不是最好的,但它比我的上述解决方案有所改进:

gen desc_a = ""
replace 
replace desc_a = desc if desc=="First Element of a"
replace desc_a = desc if desc=="Second Element of a"
replace desc_a = desc if desc=="Third Element of a"
...

replace desc_b = desc if desc_a==""