我试图分析呼叫概率和车辆距离之间的联系。
示例数据集(here csv)如下所示:
id day time called d
1 2009-06-24 1700 0 1037.6
1 2009-06-24 1710 1 1191.9
1 2009-06-24 1720 0 165.5
真实数据集有1000万行。在(此处)10分钟的不同时间窗口中,有id
代表调用与否的位置。
我想首先删除所有行,这些行具有在整个期间的任何日期此时从未调用过的相同ID。
然后我留下代表id
的行,这些行在给定时间的分析期间的某一天调用。
我想创建一个变量,该变量在调用的行中具有值0
和前一天(或小时,周,月,无论如何,但在这一天),同时它等于{{ 1}}和-1
之后的一天等。稍后我会将该变量与+1
和called
一起用作输入,以便在不同位置进行分析和比较
我已经找了其他已回答的问题,但没找到合适的东西。所以回答或指向一个人将不胜感激。我正在使用Stata 13,但是使用Postgres 9.3或R解决这个问题也是受欢迎的。
对于多个数据集,我需要多次重复此过程,所以理想情况下我希望尽可能自动化。
更新:
Here is期望结果的示例:
distance
我添加了id day time called d newvar newvar2
1 2009-06-24 1700 0 1037.6 null
1 2009-06-24 1710 1 1191.9 0 -2
1 2009-06-24 1720 0 165.5 -1
1 2009-06-25 1700 0 526.7 null
1 2009-06-25 1710 0 342.5 1 -1
1 2009-06-25 1720 1 416.1 0
1 2009-06-26 1700 0 428.3 null
1 2009-06-26 1710 1 240.7 2 0
1 2009-06-26 1720 0 228.7 1
1 2009-06-27 1700 0 282.5 null
1 2009-06-27 1710 0 182.1 3 1
1 2009-06-27 1720 0 195.5 2
2 2009-06-24 1700 0 198.0 -1
2 2009-06-24 1710 0 157.4 null
2 2009-06-24 1720 0 234.9 null
2 2009-06-25 1700 1 247.0 0
,因为某些位置可能会在给定的时间窗口多次调用
答案 0 :(得分:2)
在寻找Stata解决方案时,最好使用dataex
(来自SSC)提供数据示例。
在数据按id
和time
排序(并进一步按day
排序)之前,很难看到问题。我没有将day
变量转换为Stata数字日期,因为在构造时,字符串排序顺序与自然日期顺序匹配。
对于id time
组内的每次通话,您似乎都希望与通话日相关的日期偏移量。这可以通过生成一个顺序变量来跟踪每个id time
组内当前观察的索引,然后减去进行调用的观察指数来完成。
由于每个时隙可以有多个呼叫,因此必须在数据的任何给定时隙内循环调用最大呼叫数。
与您的解决方案相比,此解决方案生成的结果有一点不同:您似乎忽略了2009-06-27
上1710
对id == 2
的呼叫。
在下面的示例中,原始数据按id time day
排序,以便让读者更好地了解正在发生的事情。
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id str10 day int time byte called float distance str4 newvar byte newvar2
1 "2009-06-24" 1700 0 1037.6 "null" .
1 "2009-06-25" 1700 0 526.7 "null" .
1 "2009-06-26" 1700 0 428.3 "null" .
1 "2009-06-27" 1700 0 282.5 "null" .
1 "2009-06-24" 1710 1 1191.9 "0" -2
1 "2009-06-25" 1710 0 342.5 "1" -1
1 "2009-06-26" 1710 1 240.7 "2" 0
1 "2009-06-27" 1710 0 182.1 "3" 1
1 "2009-06-24" 1720 0 165.5 "-1" .
1 "2009-06-25" 1720 1 416.1 "0" .
1 "2009-06-26" 1720 0 228.7 "1" .
1 "2009-06-27" 1720 0 195.5 "2" .
2 "2009-06-24" 1700 0 198 "-1" .
2 "2009-06-25" 1700 1 247 "0" .
2 "2009-06-26" 1700 0 188.7 "1" .
2 "2009-06-27" 1700 0 203.5 "2" .
2 "2009-06-24" 1710 0 157.4 "null" .
2 "2009-06-25" 1710 0 221.3 "null" .
2 "2009-06-26" 1710 0 283.8 "null" .
2 "2009-06-27" 1710 1 91.7 "null" .
2 "2009-06-24" 1720 0 234.9 "null" .
2 "2009-06-25" 1720 0 249.6 "null" .
2 "2009-06-26" 1720 0 279.7 "null" .
2 "2009-06-27" 1720 0 198.2 "null" .
3 "2009-06-24" 1700 0 156.1 "-1" .
3 "2009-06-25" 1700 1 19.9 "0" .
3 "2009-06-26" 1700 0 195.2 "1" .
3 "2009-06-27" 1700 0 306.2 "2" .
3 "2009-06-24" 1710 0 150.1 "null" .
3 "2009-06-25" 1710 0 163.7 "null" .
3 "2009-06-26" 1710 0 288.2 "null" .
3 "2009-06-27" 1710 0 311.7 "null" .
3 "2009-06-24" 1720 0 135.1 "-2" .
3 "2009-06-25" 1720 0 186 "-1" .
3 "2009-06-26" 1720 1 297.2 "0" .
3 "2009-06-27" 1720 0 375.9 "1" .
end
* order observations by date within a id time group
sort id time day
by id time: gen order = _n
* number of calls at any given time
by id time: gen call = sum(called)
* repeat enough to cover the max number of calls per time
sum call, meanonly
local n = r(max)
forvalues i = 1/`n' {
// the index of the called observation in the id time group
by id time: gen index = order if called & call == `i'
// replicate the index for all observations in the id time group
by id time: egen gindex = total(index)
// the relative position of each obs in groups with a call
gen wanted`i' = order - gindex if gindex > 0
drop index gindex
}
list, sepby(id time) noobs compress
和结果
. list, sepby(id time) noobs compress
+----------------------------------------------------------------------------------------+
| id day time cal~d dist~e new~r new~2 order call wan~1 wan~2 |
|----------------------------------------------------------------------------------------|
| 1 2009-06-24 1700 0 1037.6 null . 1 0 . . |
| 1 2009-06-25 1700 0 526.7 null . 2 0 . . |
| 1 2009-06-26 1700 0 428.3 null . 3 0 . . |
| 1 2009-06-27 1700 0 282.5 null . 4 0 . . |
|----------------------------------------------------------------------------------------|
| 1 2009-06-24 1710 1 1191.9 0 -2 1 1 0 -2 |
| 1 2009-06-25 1710 0 342.5 1 -1 2 1 1 -1 |
| 1 2009-06-26 1710 1 240.7 2 0 3 2 2 0 |
| 1 2009-06-27 1710 0 182.1 3 1 4 2 3 1 |
|----------------------------------------------------------------------------------------|
| 1 2009-06-24 1720 0 165.5 -1 . 1 0 -1 . |
| 1 2009-06-25 1720 1 416.1 0 . 2 1 0 . |
| 1 2009-06-26 1720 0 228.7 1 . 3 1 1 . |
| 1 2009-06-27 1720 0 195.5 2 . 4 1 2 . |
|----------------------------------------------------------------------------------------|
| 2 2009-06-24 1700 0 198 -1 . 1 0 -1 . |
| 2 2009-06-25 1700 1 247 0 . 2 1 0 . |
| 2 2009-06-26 1700 0 188.7 1 . 3 1 1 . |
| 2 2009-06-27 1700 0 203.5 2 . 4 1 2 . |
|----------------------------------------------------------------------------------------|
| 2 2009-06-24 1710 0 157.4 null . 1 0 -3 . |
| 2 2009-06-25 1710 0 221.3 null . 2 0 -2 . |
| 2 2009-06-26 1710 0 283.8 null . 3 0 -1 . |
| 2 2009-06-27 1710 1 91.7 null . 4 1 0 . |
|----------------------------------------------------------------------------------------|
| 2 2009-06-24 1720 0 234.9 null . 1 0 . . |
| 2 2009-06-25 1720 0 249.6 null . 2 0 . . |
| 2 2009-06-26 1720 0 279.7 null . 3 0 . . |
| 2 2009-06-27 1720 0 198.2 null . 4 0 . . |
|----------------------------------------------------------------------------------------|
| 3 2009-06-24 1700 0 156.1 -1 . 1 0 -1 . |
| 3 2009-06-25 1700 1 19.9 0 . 2 1 0 . |
| 3 2009-06-26 1700 0 195.2 1 . 3 1 1 . |
| 3 2009-06-27 1700 0 306.2 2 . 4 1 2 . |
|----------------------------------------------------------------------------------------|
| 3 2009-06-24 1710 0 150.1 null . 1 0 . . |
| 3 2009-06-25 1710 0 163.7 null . 2 0 . . |
| 3 2009-06-26 1710 0 288.2 null . 3 0 . . |
| 3 2009-06-27 1710 0 311.7 null . 4 0 . . |
|----------------------------------------------------------------------------------------|
| 3 2009-06-24 1720 0 135.1 -2 . 1 0 -2 . |
| 3 2009-06-25 1720 0 186 -1 . 2 0 -1 . |
| 3 2009-06-26 1720 1 297.2 0 . 3 1 0 . |
| 3 2009-06-27 1720 0 375.9 1 . 4 1 1 . |
+----------------------------------------------------------------------------------------+