我有一个数据框格式的数据集,如下所示:
wpt ID Fuel Express Local
1 S36 12 0 1
2 S36 14 1 0
inter S36 NA 1 0
inter S36 NA 1 0
3 S36 16 1 0
inter S36 NA 0 1
4 S36 18 1 0
5 S36 22 1 0
6 W09 45 0 1
inter W09 NA 1 0
7 W09 48 0 1
我想处理子数据dat [c(2,inter,inter,3),](任何部分用" inter"结合常规编号wpt )作为一个单位。
(1)计算我的数据框中有多少个这样的子单元,在这种情况下它有两个(单元行,从2到3,以及3到4)
(2)然后计算从子单元的起始值到结束值 Express或Local值一致的单位数。在这种情况下,它有1个这样的单位一致(第2行,第3行,所有Express )和1个单位不同(第3行到第4行,以Express开头和结尾,从这些单位的起始值或结束值开始,就是本地。
(3)所有计算均按身份证明。
预期的输出是这样的:
ID consistent total
S36 1 2
W09 0 1
(4)如果我想在Fuel列中插入缺失值怎么办?做简单的线性插值。就像前两个NAs被14.66667和15.33333所取代,它们来自:
seq(14, 16, length.out=3)
这样的预期结果:
wpt ID Fuel Express Local
1 S36 12 0 1
2 S36 14 1 0
inter S36 14.66667 1 0
inter S36 15.33333 1 0
3 S36 16 1 0
inter S36 17 0 1
4 S36 18 1 0
5 S36 22 1 0
6 W09 45 0 1
inter W09 45.75 1 0
inter W09 46.50 1 0
inter W09 47.25 1 0
7 W09 48 0 1
提前致谢!
答案 0 :(得分:3)
subs <- with(rle(df$wpt),{
ends <- cumsum(lengths);
n <- grepl('^[0-9]+$',values);
w <- which(head(n,-2L) & values[-c(1L,length(n))]=='inter' & tail(n,-2L));
data.frame(start=c(0L,ends)[w]+1L,end=ends[w+2L]);
});
subs$ID <- df$ID[subs$start];
subs$consistent <- mapply(function(s,e,eq) all(eq[s:e]),subs$start,subs$end-1L,MoreArgs=list(diff(df$Express)==0L));
subs;
## start end ID consistent
## 1 2 5 S36 TRUE
## 2 5 7 S36 FALSE
## 3 9 11 W09 FALSE
res <- aggregate(cbind(consistent,total=rep(1L,length(ID)))~ID,subs,sum);
res;
## ID consistent total
## 1 S36 1 2
## 2 W09 0 1
数据强>
df <- data.frame(wpt=c('1','2','inter','inter','3','inter','4','5','6','inter','7'),ID=c(
'S36','S36','S36','S36','S36','S36','S36','S36','W09','W09','W09'),Fuel=c(12L,14L,NA,NA,16L,
NA,18L,22L,45L,NA,48L),Express=c(0L,1L,1L,1L,1L,0L,1L,1L,0L,1L,0L),Local=c(1L,0L,0L,0L,0L,1L,
0L,0L,1L,0L,1L),stringsAsFactors=F);
答案 1 :(得分:1)
如果我的意图是正确的,我对使用data.table
和rle
的案例(1)的建议是
library(data.table)
dt <- fread("
wpt ID Fuel Express Local
1 S36 12 0 1
2 S36 14 1 0
inter S36 NA 1 0
inter S36 NA 1 0
3 S36 16 1 0
inter S36 NA 0 1
4 S36 18 1 0
5 S36 22 1 0
6 W09 45 0 1
inter W09 NA 1 0
7 W09 48 0 1")
rdt <- dt[, rle(wpt), by = ID]
rdt[values == "inter" & lengths >= 1, .(total = .N), by = ID]
导致
ID total
1: S36 2
2: W09 1
答案 2 :(得分:1)
要计算模式,您可以使用rle
。例如:
x <- rle(df$wpt == 'inter')
y <- which(x$values)
cumsum(x$lengths)[y - 1L] + 1L # run starts
#[1] 3 6 10
x$lengths[y] # run lengths
#[1] 2 1 1
subunits <- lapply(y, function(i)
seq(cumsum(x$lengths)[i - 1L], length.out=x$lengths[i] + 2L))
现在subunits
是行索引列表
subunits
#[[1]]
#[1] 2 3 4 5
#
#[[2]]
#[1] 5 6 7
#
#[[3]]
#[1] 9 10 11
但要插入缺失值,这些都不是必需的,而是可以做
nas <- is.na(df$Fuel)
df$Fuel[nas] <- approx(seq(nrow(df)), df$Fuel, xout=which(nas))$y
df
# wpt ID Fuel Express Local
#1 1 S36 12.00000 0 1
#2 2 S36 14.00000 1 0
#3 inter S36 14.66667 1 0
#4 inter S36 15.33333 1 0
#5 3 S36 16.00000 1 0
#6 inter S36 17.00000 0 1
#7 4 S36 18.00000 1 0
#8 5 S36 22.00000 1 0
#9 6 W09 45.00000 0 1
#10 inter W09 46.50000 1 0
#11 7 W09 48.00000 0 1