我有一个数据库,其中有许多人(可能)对一次运行的服务有多个订阅,并且在订阅期间为每个事件提供事务数据。我正在尝试创建一个变量,该变量计算用户在给定事务时间具有的当前活动订阅的数量。
为了举例说明,我的数据采用以下形式:
person | subscription | obs_date | sub_start_date | sub_end_date | num_concurrent_subs
--------------------------------------------------------------------------------------
1 | 1 | 09/01/10 | 09/01/10 | 09/01/11 | 1
1 | 1 | 10/01/10 | 09/01/10 | 09/01/11 | 2
1 | 1 | 11/01/10 | 09/01/10 | 09/01/11 | 2
1 | 2 | 10/01/10 | 10/01/10 | 09/01/11 | 2
1 | 2 | 11/01/10 | 10/01/10 | 09/01/11 | 2
1 | 3 | 11/01/14 | 09/01/14 | . | 1
1 | 3 | 11/01/16 | 09/01/14 | . | 1
1 | 4 | 11/01/15 | 10/01/15 | 11/01/15 | 3
1 | 5 | 11/01/15 | 10/01/15 | 11/01/15 | 3
等等每个人都等等。我想像上面那样生成num_concurrent_subs
。
也就是说,对于每个人,请查看每个观察结果,并查看它属于sub_start_date
到sub_end_date
范围内的订阅数量。
我已经了解了Stata的count
功能,并相信我已接近解决方案,但我不确定如何在不同的订阅中查看它
答案 0 :(得分:1)
您可以通过将订阅信息与交易数据分开并将订阅数据转换为长格式来完成此操作,其中一个观察开始日期,另一个观察结束日期。然后,您通过单个日期变量重新组合交易数据和订单。您使用onoff
变量来跟踪每个订阅的开始和结束。类似的东西:
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(person subscription) str8(obs_date sub_start_date sub_end_date) byte num_concurrent_subs
1 1 "09/01/10" "09/01/10" "09/01/11" 1
1 1 "10/01/10" "09/01/10" "09/01/11" 2
1 1 "11/01/10" "09/01/10" "09/01/11" 2
1 2 "10/01/10" "10/01/10" "09/01/11" 2
1 2 "11/01/10" "10/01/10" "09/01/11" 2
1 3 "11/01/14" "09/01/14" "." 1
1 3 "11/01/16" "09/01/14" "." 1
1 4 "11/01/15" "10/01/15" "11/01/15" 3
1 5 "11/01/15" "10/01/15" "11/01/15" 3
end
* should always have an observation identifier
gen obsid = _n
* convert string to Stata numeric dates
gen odate = daily(obs_date,"MD20Y")
gen substart = daily(sub_start_date,"MD20Y")
gen subend = daily(sub_end_date,"MD20Y")
format %td odate substart subend
save "main_data.dta", replace
* reduce to subscription info with one obs for the start and one obs
* for the end of each subscription. use an onoff variable to tract
* start and end events
keep person subscription substart subend
bysort person subscription substart subend: keep if _n == 1
expand 2
bysort person subscription: gen adate = cond(_n == 1, substart, subend)
by person subscription: gen onoff = cond(_n == 1, 1, -1)
replace onoff = 0 if mi(adate)
format %td adate
append using "main_data.dta"
* include obs date in adate and nothing happens on the observation date
replace adate = odate if !mi(obsid)
replace onoff = 0 if !mi(obsid)
* order by person adate, put on event first, then obs events, then off events
gsort person adate -onoff
by person: gen concur = sum(onoff)
* return to original obs
keep if !mi(obsid)
sort obsid
答案 1 :(得分:1)
这是使用rangejoin
(来自SSC)执行此操作的另一种方法。要安装它,请输入Stata的命令窗口:
ssc install rangejoin
使用rangejoin
,您可以将每个订阅与订阅开始日期和结束日期内的所有交易数据配对。然后,根据交易观察,计算与其配对的订阅数量只是一个问题。
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(person subscription) str8(obs_date sub_start_date sub_end_date) byte num_concurrent_subs
1 1 "09/01/10" "09/01/10" "09/01/11" 1
1 1 "10/01/10" "09/01/10" "09/01/11" 2
1 1 "11/01/10" "09/01/10" "09/01/11" 2
1 2 "10/01/10" "10/01/10" "09/01/11" 2
1 2 "11/01/10" "10/01/10" "09/01/11" 2
1 3 "11/01/14" "09/01/14" "." 1
1 3 "11/01/16" "09/01/14" "." 1
1 4 "11/01/15" "10/01/15" "11/01/15" 3
1 5 "11/01/15" "10/01/15" "11/01/15" 3
end
* should always have an observation identifier
gen obsid = _n
* convert string to Stata numeric dates
gen odate = daily(obs_date,"MD20Y")
gen substart = daily(sub_start_date,"MD20Y")
gen subend = daily(sub_end_date,"MD20Y")
format %td odate substart subend
save "main_data.dta", replace
* reduce to subscription start and end date per person
bysort person subscription substart subend: keep if _n == 1
keep person substart subend
* missing values will exclude obs so use a date in the future
replace subend = mdy(1,1,2099) if mi(subend)
* pair each subscription with an obs date
rangejoin odate substart subend using "main_data.dta", by(person)
* the number of current subcription is the number of pairings
bysort obsid: gen current = _N
* return to original obs
by obsid: keep if _n == 1
sort obsid
drop substart subend
rename (substart_U subend_U) (substart subend)