如何从数据框中删除不一致(时间序列)

时间:2017-12-11 04:55:56

标签: r dataframe time-series aggregate

让我们说我们有这个数据框:

x<- as.data.frame(cbind(c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
                        c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5),
                        c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5),
                        c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5)))
colnames(x)<- c("ID", "Visit", "Time", "State")

ID表示主题ID。

Visit表示一系列访问

Time表示到达某个&#34;州&#34;

所经过的时间

State表示某种疾病的严重程度,其中5表示死亡。这意味着你可以从更糟糕的状态波动到更好的状态,但是你永远无法从第5类中提升,因为你已经死了。

我只想确定那些从第5类改进为更好的主题的主题,因为这些是数据框中的错误(即第13行和第16行)。

此外,我想删除那些主题似乎不止一次死亡的行(即第18行)。

我提出了一个类似的问题before,但它非常笼统,它暗示了从数据集中删除了所有对更好状态的波动,这不是我真正想要的。

4 个答案:

答案 0 :(得分:2)

回答原始问题

OP要求确定数据框中的错误,其中状态5跟随任何状态&lt;每个ID 5个。在样本数据集中,应标记行13和16。

answer of Hardik gupta指向正确方向,但未返回预期结果。因此,标记了行12和15而不是行13和16.此外,第17行设置了错误警报。

需要进行三项必要的更改:(1)使用lag代替lead和(2)向fill提供shift()值:

library(data.table)
setDT(x)[, error := State < 5 & shift(State, fill = 0) == 5, by = ID][]
    ID Visit Time State error
 1:  A     1 10.0     1 FALSE
 2:  A     2 12.5     3 FALSE
 3:  A     3 15.0     4 FALSE
 4:  B     1  2.0     1 FALSE
 5:  B     2  3.4     2 FALSE
 6:  B     3  5.7     3 FALSE
 7:  B     2  8.0     4 FALSE
 8:  B     3  9.5     3 FALSE
 9:  C     1  1.0     2 FALSE
10:  C     2  5.6     2 FALSE
11:  C     3  8.9     3 FALSE
12:  C     4 10.0     5 FALSE
13:  C     5 11.0     2  TRUE
14:  D     1  2.0     3 FALSE
15:  D     2  3.4     5 FALSE
16:  D     3  6.0     4  TRUE
17:  D     4  8.0     5 FALSE
18:  D     5 10.5     5 FALSE

数据

创建样本数据集需要进行第三次更改。

cbind()返回一个矩阵,将所有列转换为相同的类型,在这种情况下是因子。因此,所有由数字组成的列都被视为因子。为避免这种情况,需要将样本数据集定义为:

x <- data.frame(
  ID = c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
  Visit = c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5),
  Time = c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5),
  State = c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5))

答案 1 :(得分:2)

回答修改后的问题

OP通过要求在第一次出现状态5(死亡)之后出现所有行被认为是错误的,已经大大修改了这个问题。这包括错误的恢复(如第13和16行)以及重复的死亡&#34; (如第17和18行)。

对此的回答需要完全不同的方法。一种可能性是使用非equi join

library(data.table)
setDT(x)[x[, first(Visit[State == 5]), by = ID], on = .(ID, Visit > V1), error := TRUE][]
    ID Visit Time State error
 1:  A     1 10.0     1    NA
 2:  A     2 12.5     3    NA
 3:  A     3 15.0     4    NA
 4:  B     1  2.0     1    NA
 5:  B     2  3.4     2    NA
 6:  B     3  5.7     3    NA
 7:  B     2  8.0     4    NA
 8:  B     3  9.5     3    NA
 9:  C     1  1.0     2    NA
10:  C     2  5.6     2    NA
11:  C     3  8.9     3    NA
12:  C     4 10.0     5    NA
13:  C     5 11.0     2  TRUE
14:  D     1  2.0     3    NA
15:  D     2  3.4     5    NA
16:  D     3  6.0     4  TRUE
17:  D     4  8.0     5  TRUE
18:  D     5 10.5     5  TRUE

状态5的首次访问次数由

返回
x[, first(Visit[State == 5]), by = ID]
   ID V1
1:  C  4
2:  D  2

在随后的 non-equi join 中,只会标记出现在第一个State 5事件之后的行。

数据

x <- data.frame(
  ID = c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
  Visit = c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5),
  Time = c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5),
  State = c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5))

答案 2 :(得分:1)

您可以像这样使用data.tableshift

library(data.table)
setDT(x)[, status := ((State == 5) & (shift(State,1,"lead") != 5)), by = ID]
x
   ID Visit Time State status
1:  A     1   10     1  FALSE
2:  A     2 12.5     3  FALSE
3:  A     3   15     4  FALSE
4:  B     1    2     1  FALSE
5:  B     2  3.4     2  FALSE
6:  B     3  5.7     3  FALSE
7:  B     2    8     4  FALSE
8:  B     3  9.5     3  FALSE
9:  C     1    1     2  FALSE
10:  C     2  5.6     2  FALSE
11:  C     3  8.9     3  FALSE
12:  C     4   10     5   TRUE
13:  C     5   11     2  FALSE
14:  D     1    2     3  FALSE
15:  D     2  3.4     5   TRUE
16:  D     3    6     4  FALSE
17:  D     4    8     5   TRUE
18:  D     5 10.5     5  FALSE

答案 3 :(得分:1)

我还不清楚你想做什么。 Aren排121517错误的行,应该删除吗?

do.call(rbind.data.frame, lapply(tmp, function(w) {
    idx <- diff(w$State) <= 0 & w$State[-length(w$State)] == 5;
    w[!idx, ];
}))
#     ID Visit Time State
#A.1   A     1   10     1
#A.2   A     2 12.5     3
#A.3   A     3   15     4
#B.4   B     1    2     1
#B.5   B     2  3.4     2
#B.7   B     2    8     4
#B.6   B     3  5.7     3
#B.8   B     3  9.5     3
#C.9   C     1    1     2
#C.10  C     2  5.6     2
#C.11  C     3  8.9     3
#C.13  C     5   11     2
#D.14  D     1    2     3
#D.16  D     3    6     4
#D.18  D     5 10.5     5