让我们说我们有这个数据框:
x<- as.data.frame(cbind(c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5),
c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5),
c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5)))
colnames(x)<- c("ID", "Visit", "Time", "State")
列ID
表示主题ID。
列Visit
表示一系列访问
列Time
表示到达某个&#34;州&#34;
列State
表示某种疾病的严重程度,其中5表示死亡。这意味着你可以从更糟糕的状态波动到更好的状态,但是你永远无法从第5类中提升,因为你已经死了。
我只想确定那些从第5类改进为更好的主题的主题,因为这些是数据框中的错误(即第13行和第16行)。
此外,我想删除那些主题似乎不止一次死亡的行(即第18行)。
我提出了一个类似的问题before,但它非常笼统,它暗示了从数据集中删除了所有对更好状态的波动,这不是我真正想要的。
答案 0 :(得分:2)
OP要求确定数据框中的错误,其中状态5跟随任何状态&lt;每个ID 5个。在样本数据集中,应标记行13和16。
answer of Hardik gupta指向正确方向,但未返回预期结果。因此,标记了行12和15而不是行13和16.此外,第17行设置了错误警报。
需要进行三项必要的更改:(1)使用lag
代替lead
和(2)向fill
提供shift()
值:
library(data.table)
setDT(x)[, error := State < 5 & shift(State, fill = 0) == 5, by = ID][]
ID Visit Time State error 1: A 1 10.0 1 FALSE 2: A 2 12.5 3 FALSE 3: A 3 15.0 4 FALSE 4: B 1 2.0 1 FALSE 5: B 2 3.4 2 FALSE 6: B 3 5.7 3 FALSE 7: B 2 8.0 4 FALSE 8: B 3 9.5 3 FALSE 9: C 1 1.0 2 FALSE 10: C 2 5.6 2 FALSE 11: C 3 8.9 3 FALSE 12: C 4 10.0 5 FALSE 13: C 5 11.0 2 TRUE 14: D 1 2.0 3 FALSE 15: D 2 3.4 5 FALSE 16: D 3 6.0 4 TRUE 17: D 4 8.0 5 FALSE 18: D 5 10.5 5 FALSE
创建样本数据集需要进行第三次更改。
cbind()
返回一个矩阵,将所有列转换为相同的类型,在这种情况下是因子。因此,所有由数字组成的列都被视为因子。为避免这种情况,需要将样本数据集定义为:
x <- data.frame( ID = c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"), Visit = c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5), Time = c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5), State = c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5))
答案 1 :(得分:2)
OP通过要求在第一次出现状态5(死亡)之后出现所有行被认为是错误的,已经大大修改了这个问题。这包括错误的恢复(如第13和16行)以及重复的死亡&#34; (如第17和18行)。
对此的回答需要完全不同的方法。一种可能性是使用非equi join :
library(data.table)
setDT(x)[x[, first(Visit[State == 5]), by = ID], on = .(ID, Visit > V1), error := TRUE][]
ID Visit Time State error 1: A 1 10.0 1 NA 2: A 2 12.5 3 NA 3: A 3 15.0 4 NA 4: B 1 2.0 1 NA 5: B 2 3.4 2 NA 6: B 3 5.7 3 NA 7: B 2 8.0 4 NA 8: B 3 9.5 3 NA 9: C 1 1.0 2 NA 10: C 2 5.6 2 NA 11: C 3 8.9 3 NA 12: C 4 10.0 5 NA 13: C 5 11.0 2 TRUE 14: D 1 2.0 3 NA 15: D 2 3.4 5 NA 16: D 3 6.0 4 TRUE 17: D 4 8.0 5 TRUE 18: D 5 10.5 5 TRUE
状态5的首次访问次数由
返回x[, first(Visit[State == 5]), by = ID]
ID V1 1: C 4 2: D 2
在随后的 non-equi join 中,只会标记出现在第一个State 5事件之后的行。
x <- data.frame(
ID = c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
Visit = c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5),
Time = c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5),
State = c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5))
答案 2 :(得分:1)
您可以像这样使用data.table
和shift
library(data.table)
setDT(x)[, status := ((State == 5) & (shift(State,1,"lead") != 5)), by = ID]
x
ID Visit Time State status
1: A 1 10 1 FALSE
2: A 2 12.5 3 FALSE
3: A 3 15 4 FALSE
4: B 1 2 1 FALSE
5: B 2 3.4 2 FALSE
6: B 3 5.7 3 FALSE
7: B 2 8 4 FALSE
8: B 3 9.5 3 FALSE
9: C 1 1 2 FALSE
10: C 2 5.6 2 FALSE
11: C 3 8.9 3 FALSE
12: C 4 10 5 TRUE
13: C 5 11 2 FALSE
14: D 1 2 3 FALSE
15: D 2 3.4 5 TRUE
16: D 3 6 4 FALSE
17: D 4 8 5 TRUE
18: D 5 10.5 5 FALSE
答案 3 :(得分:1)
我还不清楚你想做什么。 Aren排12
,15
和17
错误的行,应该删除吗?
do.call(rbind.data.frame, lapply(tmp, function(w) {
idx <- diff(w$State) <= 0 & w$State[-length(w$State)] == 5;
w[!idx, ];
}))
# ID Visit Time State
#A.1 A 1 10 1
#A.2 A 2 12.5 3
#A.3 A 3 15 4
#B.4 B 1 2 1
#B.5 B 2 3.4 2
#B.7 B 2 8 4
#B.6 B 3 5.7 3
#B.8 B 3 9.5 3
#C.9 C 1 1 2
#C.10 C 2 5.6 2
#C.11 C 3 8.9 3
#C.13 C 5 11 2
#D.14 D 1 2 3
#D.16 D 3 6 4
#D.18 D 5 10.5 5