Question

在Stata中，我有一些在项目上一起工作的人的数据。每一行都是一个项目，并且列为person_1到person_20，如果名称在该列中，则表示该人在该行中处理该项目。一组可以是1人，2人，......，20人。我有一个二进制变量（是= 1）为每种可能性：组1（G1），G2，...，G11。然后我用这段代码进行分组（以4人组为例）：

project_group = person_1 + "/" + person_2 + "/" + person_3 + "/" + person_4 if G4 == 1
This yields: Tom/Joe/Mike/Sally

我有三个问题： 1）是否有更有效的方法进行分组。例如，只查看项目（一行）的代码，计算有多少人（有多少字段不为空），然后创建一个唯一的组名，每个人的名字用“/”分隔。我对我创建的代码很好，但我的数据集会改变大小，更高效的代码可能是最好的

2）从我的例子中，我如何将Joe / Tom / Mike / Sally或Sally / Joe / Mike / Time视为同一组。我希望所有组，无论大小，按字母顺序列出每个人。从我的例子来看，无论实际排列如何，列表都是Joe / Mike / Sally / Tom。

3）如何根据第一个人创建一个唯一的组（如果他们是项目负责人，他们是列出的名字）。所以Joe / Tom / Mike和Joe / Mike / Tom是同一组，但Tom / Joe / Mike和Mike / Tom / Joe不是。

感谢您的帮助和建议

Answer 1

我重新安排了我的方法，以便更清楚。你提到你不能重新编码你的变量，但我不确定是否有办法解决这个问题（我认为这里的任何解决方案都是明确地或隐含地重新编码）。当然，你需要在整个过程中用“20”替换“4”。

* generate some projects and members
clear
set obs 5
generate int project = _n
generate person_1 = "Tom"
generate person_2 = "Dick" if (_n >= 3)
generate person_3 = "Harry" if (_n >=5)
replace person_1 = "Jane" if inlist(_n, 2, 4)
tempfile orig
save `orig'

* reshape to long
reshape long person_, i(project) string
drop _j
drop if missing(person)
sort project person
egen id = group(person)
drop if missing(id)
reshape wide person, i(project) j(id)

* recode to allow easier group identification
forvalues i = 1/4 {
    levelsof person_`i', local(name) clean
    generate byte d_person_`i' = cond(missing(person_`i'), 0, 1)
    label define d_person_`i'_lbl 1 "`name'" 0 ""
    label values d_person_`i' d_person_`i'_lbl
}

* determine number of workers on project
egen gp_size = rowtotal(d_person_*)

* unique id for each group composition
generate int id = 0
forvalues i = 1/4 {
    local two_i = 2^(`i' - 1)
    replace id = id + d_person_`i' * `two_i'
}

* group members
generate str mbrs = ""
forvalues i = 1/4 {
    local name: label d_person_`i'_lbl 1
    replace mbrs = mbrs + "/" + "`name'" if (d_person_`i' == 1)
}   

* there's always a leading "/" to remove with this approach
replace m = substr(m, 2, .)

* merge back your orig data
merge 1:1 project using `orig', nogenerate replace update

这会产生：

. list

     +---------------------------------------------------------------------------------------------------------------------------------+
     | project   person_1   person_2   person_3   person_4   d_pers~1   d_pers~2   d_pers~3   d_pers~4   gp_size   id             mbrs |
     |---------------------------------------------------------------------------------------------------------------------------------|
  1. |       1        Tom                              Tom                                         Tom         1    8              Tom |
  2. |       2       Jane                  Jane                                        Jane                    1    4             Jane |
  3. |       3        Tom       Dick                   Tom       Dick                              Tom         2    9         Dick/Tom |
  4. |       4       Jane       Dick       Jane                  Dick                  Jane                    2    5        Dick/Jane |
  5. |       5        Tom       Dick      Harry        Tom       Dick      Harry                   Tom         3   11   Dick/Harry/Tom |
     +---------------------------------------------------------------------------------------------------------------------------------+

Answer 2

1）是否有更有效的方法进行分组。

我不确定我是否理解您当前的安排有什么问题，这种安排看起来很干净且易读。

2）从我的例子中，我如何将Joe / Tom / Mike / Sally或Sally / Joe / Mike / Time视为同一组。

我认为你的意思是汤姆而不是你最后一个字符串中的时间。

egen team_size = anycount(person_1-person_20), v(1)
gen team_leader = .
if team_size > 0 replace team_leader = person_1
gen team_structure = 0
replace team_structure = team_structure + regexm(project_group,"Joe")
replace team_structure = team_structure + regexm(project_group,"Tom")*10
replace team_structure = team_structure + regexm(project_group,"Mike")*100
replace team_structure = team_structure + regexm(project_group,"Sally")*1000

team_structure是$ k $ -length二进制文件，它编码$ k $成员的团队成员资格，而不管它们在project_group字符串中的显示顺序。如果你有很多成员，但代码很昂贵，但很容易创建。

3）如何根据第一个人创建一个唯一的组（如果他们是项目负责人，他们是列出的名字）。所以Joe / Tom / Mike和Joe / Mike / Tom是同一组，但Tom / Joe / Mike和Mike / Tom / Joe不是。

琐碎的建议：为每个可能的成员添加一个因子（encode），并将其分配给上面创建的team_structure变量的小数。示例：1011.1是由Joe领导的Joe / Mike / Sally组，1011.4是由Sally领导的同一组，依此类推。

等价群

2 个答案: