我有以下数据,我的问题如下:在某个时间点和某个地方,会发生污染。我不知道是谁造成的,但我希望能够尽可能地追溯这一点。我需要每个人都有可能成为这种污染的原因。这就是所需的专栏“Prob_Contiminator”应该显示的内容。
我知道有关发生污染的注意事项,但这只是报告污染的时间。我想到的是,如果有人在时间上接近发生,很可能造成污染,这会使个人观察到的污染物进一步减少。
重要的是,只有个体被认为造成污染,如果它们具有与出现行相同的location_id。另一个问题是经常出现在数据中的人会自动地更频繁地引起污染。我还有关于何时发生清洁的数据。我考虑将这些“常用用户”的观察限制在一个清洁间隔内最接近事件的观察。如何在不区分碰巧是“重度用户”的人的情况下正确发现污染物?
数据:
"Event_ID" "Person_ID" "Note" "time" "location_id" "Cleaning"
1 1 "" 1990-01-01 1 1
2 1 "" 1990-01-02 1 0
3 2 "" 1990-01-03 1 0
4 3 "Occured" 1993-01-03 1 1
5 3 "" 1995-01-04 2 0
6 3 "" 1995-01-04 2 0
7 4 "" 1995-01-04 3 0
8 5 "" 1995-01-05 6 0
9 6 "" 1995-01-05 5 0
10 7 "Ocurred" 1995-01-05 6 1
这就是我需要的(Prob_Contaminator专栏未完成):
"Event_ID" "Person_ID" "Note" "time" "location_id" "Cleaning" "Prob_Contaminator"
1 1 "" 1990-01-01 1 1 0.4
2 1 "" 1990-01-02 1 0 0.4
3 2 "" 1990-01-03 1 0 0.6
4 3 "Occured" 1993-01-03 1 1
5 3 "" 1995-01-04 2 0
6 3 "" 1995-01-04 2 0
7 4 "" 1995-01-04 3 0
8 5 "" 1995-01-05 6 0
9 6 "" 1995-01-05 5 0
10 7 "Ocurred" 1995-01-05 6 1
以下示例显示了我如何构想要构造的Prob_Contaminated列。如果我们考虑第4行(事件ID = 4),我们会发现污染已经发生。现在我想回顾自上次清洗以来的所有事件(在这种情况下,3个事件,基于在event_ID = 1时进行的清洁)并考虑它们离污染事件的距离。这应该仅在事件查看的条件下发生在同一location_id中。由于此示例中的location_id是相同的(= 1),因此作为这3个事件的污染物的概率是1/3。应将1人的多次事件减少到最接近污染的时间。这将案例减少到两个,并且将Person_ID 1和Person_ID 2的概率设为1/2。另外,我想根据它们对污染物的距离来加权每个概率。由于Person_ID = 2的“时间”值更接近污染行而不是Person_ID = 1的“时间”值,因此Person_ID = 2的Prob_Contaminated应该加权更高。在这种情况下,我将更重的“近期”ID(1.2 * 0.5 = 0.6)和权重0.8 * 0.5 = 0.4)的权重1.2应用于最近的事件。
代码:
df <- data.frame(Event_ID = c(1:10),
Person_ID = c("1","1","2","3","3","3","4","5","6","7"),
Note = c("","","","Occured","","","","","","Ocurred"),
time = as.Date(c('1990-1-1','1990-1-2','1990-1-3','1993-1-3','1995-1-4','1995-1-4','1995-1-4',"1995-1-5","1995-1-5","1995-1-5")),
location_id = c("1","1","1","1","2","2","3","6","5","6"),
Cleaning = c("1","0","0","1","0","0","0","0","0","1"))
df2 <- data.frame(Event_ID = c(1:10),
Person_ID = c("1","1","2","3","3","3","4","5","6","7"),
Note = c("","","","Occured","","","","","","Ocurred"),
time = as.Date(c('1990-1-1','1990-1-2','1990-1-3','1993-1-3','1995-1-4','1995-1-4','1995-1-4',"1995-1-5","1995-1-5","1995-1-5")),
location_id = c("1","1","1","1","2","2","3","6","5","6"),
Cleaning = c("1","0","0","1","0","0","0","0","0","1"),
Prob_Contiminator = c("0.4","0.4","0.6","","","","","","",""))