追溯到某些个人

时间:2017-11-02 11:17:27

标签: r tidyverse

我有以下数据,我的问题如下:在某个时间点和某个地方,会发生污染。我不知道是谁造成的,但我希望能够尽可能地追溯这一点。我需要每个人都有可能成为这种污染的原因。这就是所需的专栏“Prob_Contiminator”应该显示的内容。

我知道有关发生污染的注意事项,但这只是报告污染的时间。我想到的是,如果有人在时间上接近发生,很可能造成污染,这会使个人观察到的污染物进一步减少。

重要的是,只有个体被认为造成污染,如果它们具有与出现行相同的location_id。另一个问题是经常出现在数据中的人会自动地更频繁地引起污染。我还有关于何时发生清洁的数据。我考虑将这些“常用用户”的观察限制在一个清洁间隔内最接近事件的观察。如何在不区分碰巧是“重度用户”的人的情况下正确发现污染物?

数据:

"Event_ID" "Person_ID" "Note"    "time"       "location_id" "Cleaning"
1           1           ""        1990-01-01   1             1
2           1           ""        1990-01-02   1             0
3           2           ""        1990-01-03   1             0
4           3           "Occured" 1993-01-03   1             1
5           3           ""        1995-01-04   2             0
6           3           ""        1995-01-04   2             0
7           4           ""        1995-01-04   3             0
8           5           ""        1995-01-05   6             0
9           6           ""        1995-01-05   5             0
10          7           "Ocurred" 1995-01-05   6             1

这就是我需要的(Prob_Contaminator专栏未完成):

 "Event_ID" "Person_ID" "Note"    "time"       "location_id" "Cleaning" "Prob_Contaminator"
1           1           ""        1990-01-01    1             1          0.4
2           1           ""        1990-01-02    1             0          0.4
3           2           ""        1990-01-03    1             0          0.6
4           3           "Occured" 1993-01-03    1             1
5           3           ""        1995-01-04    2             0
6           3           ""        1995-01-04    2             0
7           4           ""        1995-01-04    3             0
8           5           ""        1995-01-05    6             0
9           6           ""        1995-01-05    5             0
10          7           "Ocurred" 1995-01-05    6             1

以下示例显示了我如何构想要构造的Prob_Contaminated列。如果我们考虑第4行(事件ID = 4),我们会发现污染已经发生。现在我想回顾自上次清洗以来的所有事件(在这种情况下,3个事件,基于在event_ID = 1时进行的清洁)并考虑它们离污染事件的距离。这应该仅在事件查看的条件下发生在同一location_id中。由于此示例中的location_id是相同的(= 1),因此作为这3个事件的污染物的概率是1/3。应将1人的多次事件减少到最接近污染的时间。这将案例减少到两个,并且将Person_ID 1和Person_ID 2的概率设为1/2。另外,我想根据它们对污染物的距离来加权每个概率。由于Person_ID = 2的“时间”值更接近污染行而不是Person_ID = 1的“时间”值,因此Person_ID = 2的Prob_Contaminated应该加权更高。在这种情况下,我将更重的“近期”ID(1.2 * 0.5 = 0.6)和权重0.8 * 0.5 = 0.4)的权重1.2应用于最近的事件。

代码:

df <- data.frame(Event_ID = c(1:10),
                 Person_ID = c("1","1","2","3","3","3","4","5","6","7"),
                 Note = c("","","","Occured","","","","","","Ocurred"),
                 time = as.Date(c('1990-1-1','1990-1-2','1990-1-3','1993-1-3','1995-1-4','1995-1-4','1995-1-4',"1995-1-5","1995-1-5","1995-1-5")),
                 location_id = c("1","1","1","1","2","2","3","6","5","6"),
                 Cleaning  = c("1","0","0","1","0","0","0","0","0","1"))

df2 <- data.frame(Event_ID = c(1:10),
                 Person_ID = c("1","1","2","3","3","3","4","5","6","7"),
                 Note = c("","","","Occured","","","","","","Ocurred"),
                 time = as.Date(c('1990-1-1','1990-1-2','1990-1-3','1993-1-3','1995-1-4','1995-1-4','1995-1-4',"1995-1-5","1995-1-5","1995-1-5")),
                 location_id = c("1","1","1","1","2","2","3","6","5","6"),
                 Cleaning  = c("1","0","0","1","0","0","0","0","0","1"),
                 Prob_Contiminator  = c("0.4","0.4","0.6","","","","","","",""))

0 个答案:

没有答案