我一直致力于按周开发一个风险胎盘厚度的参考曲线。 所以我计算了每个孕周的分数.03,.05,。10,。50,.90,.95和.99。
因此,我有两个胎盘厚度和分位数的数据集。我想创建一个新变量,它在前一个数据集中使用最低和最高分位数来表示异常值。
以下是数据示例:
Data A for thickness:
ID week day thickness
1 15 0 1.3
2 15 0 1.5
3 16 2 2.3
4 16 1 3.5
5 16 1 2.5
6 17 0 3.6
7 17 0 3.4
8 17 3 2.4
Data B for quantiles:
week .03 .05 .10 .50 .90 .95 .99
15 1.6 1.7 1.8 2.4 2.6 2.7 2.8
16 1.7 1.8 2.0 2.5 3.1 3.3 3.4
17 1.7 1.8 2.1 2.6 3.4 3.5 3.7
所以我尝试使用ifelse()
语句编写代码,如下所示:
C<-within(A, {outlier = ifelse(A$Thickness<B[2] & A$week == B[1], 1, 0)
outlier = ifelse(A$Thickness>B[8] & A$week == B[1], 1, 0)})
但是,每个数据的行数不匹配都会出错。
Error in `[<-.data.frame`(`*tmp*`, nl, value = list(outlier = c(0, 0, : replacement element 1 is a matrix/data frame of 33 rows, need 55808
基于数据A的预期数据形式如下:
Data C:
ID week day thickness outlier
1 15 0 1.3 1
2 15 0 1.5 1
3 16 2 2.3 0
4 16 1 3.5 1
5 16 1 2.5 0
6 17 0 3.6 0
7 17 0 3.4 0
8 17 3 2.4 0
答案 0 :(得分:2)
使用dplyr的解决方案。我们可以执行连接,然后确定异常值条件。
library(dplyr)
B2 <- B %>% select(week, X.03, X.99)
A2 <- A %>%
left_join(B2, by = "week") %>%
mutate(outlier = as.integer(thickness < X.03 | thickness > X.99)) %>%
select(-starts_with("X"))
A2
# ID week day thickness outlier
# 1 1 15 0 1.3 1
# 2 2 15 0 1.5 1
# 3 3 16 2 2.3 0
# 4 4 16 1 3.5 1
# 5 5 16 1 2.5 0
# 6 6 17 0 3.6 0
# 7 7 17 0 3.4 0
# 8 8 17 3 2.4 0
这是同一操作的基本R版本。
B2 <- B[, c("week", "X.03", "X.99")]
A2 <- merge(A, B2, by = "week", all.x = TRUE)
A2$outlier <- as.integer(A2$thickness < A2$X.03 | A2$thickness > A2$X.99)
A2[, c("X.03", "X.99")] <- NULL
A2
# week ID day thickness outlier
# 1 15 1 0 1.3 1
# 2 15 2 0 1.5 1
# 3 16 3 2 2.3 0
# 4 16 4 1 3.5 1
# 5 16 5 1 2.5 0
# 6 17 6 0 3.6 0
# 7 17 7 0 3.4 0
# 8 17 8 3 2.4 0
以下是同一操作的data.table版本。
library(data.table)
setDT(A)
setDT(B)
B2 <- B[, .(week, X.03, X.99)]
setkey(A, week)
setkey(B2, week)
A2 <- merge(A, B2)[, outlier := as.integer(between(thickness, X.03, X.99, incbounds = FALSE)),
][, c("X.03","X.99"):=NULL]
A2[]
# week ID day thickness outlier
# 1: 15 1 0 1.3 1
# 2: 15 2 0 1.5 1
# 3: 16 3 2 2.3 0
# 4: 16 4 1 3.5 1
# 5: 16 5 1 2.5 0
# 6: 17 6 0 3.6 0
# 7: 17 7 0 3.4 0
# 8: 17 8 3 2.4 0
数据强>
A <- read.table(text = "ID week day thickness
1 15 0 1.3
2 15 0 1.5
3 16 2 2.3
4 16 1 3.5
5 16 1 2.5
6 17 0 3.6
7 17 0 3.4
8 17 3 2.4
",
header = TRUE)
B <- read.table(text = "week .03 .05 .10 .50 .90 .95 .99
15 1.6 1.7 1.8 2.4 2.6 2.7 2.8
16 1.7 1.8 2.0 2.5 3.1 3.3 3.4
17 1.7 1.8 2.1 2.6 3.4 3.5 3.7",
header = TRUE)
答案 1 :(得分:2)
我能想到的基本R解决方案。:
transform(A,outlier=as.numeric((C<-thickness-B[as.factor(week),c(2,8)])[,1]<0|C[,2]>0))
ID week day thickness outlier
1 1 15 0 1.3 1
2 2 15 0 1.5 1
3 3 16 2 2.3 0
4 4 16 1 3.5 1
5 5 16 1 2.5 0
6 6 17 0 3.6 0
7 7 17 0 3.4 0
8 8 17 3 2.4 0
您可以决定按如下方式编写:
C=A$thickness-B[as.factor(A$week),c(2,8)] #Only columns 2 and 8 subtract from A
transform(A,outlier=as.numeric(C[,1]<0|C[,2]>0)) #eg If the first column is -ve then an outlier
ID week day thickness outlier
1 1 15 0 1.3 1
2 2 15 0 1.5 1
3 3 16 2 2.3 0
4 4 16 1 3.5 1
5 5 16 1 2.5 0
6 6 17 0 3.6 0
7 7 17 0 3.4 0
8 8 17 3 2.4 0
答案 2 :(得分:1)
以下是使用data.table
加入
library(data.table)
setDT(A)[B[c('week', '.03', '.99')], outlier :=
as.integer(thickness < `.03`| thickness > `.99`), on = .(week)]
A
# ID week day thickness outlier
#1: 1 15 0 1.3 1
#2: 2 15 0 1.5 1
#3: 3 16 2 2.3 0
#4: 4 16 1 3.5 1
#5: 5 16 1 2.5 0
#6: 6 17 0 3.6 0
#7: 7 17 0 3.4 0
#8: 8 17 3 2.4 0