我刚刚开始学习dplyr并且正在努力结合多种功能。
我有以下数据集。
Time ID1 ID2 ID3
1 2018-01-01 00:00:00 74 77 60
2 2018-01-01 00:00:01 75 79 61
3 2018-01-01 00:00:02 75 79 61
我尝试首先计算每秒更改次数(使用lag(.)
)并将名为IDX_lag
的3个新列追加到初始数据集中。然后,使用roll_mean(colData, 2, align = "left", fill = 0)
计算滞后列的移动平均值,并附加名为IDX_mov_avg
的3个新列。然后,为IDX_bin
的每个ID创建一个名为IDX_mov_avg > 10
的二进制列。最后 - 这可能是最复杂的部分 - 对于二进制列中的每个1,计算一个名为IDX_alt
的新列,其中包含1之前和之后的原始ID值的平均值。例如:
Time ID1 [other columns] ID1_bin ID1_alt
1 2018-01-01 00:00:00 74 ... 0 74
2 2018-01-01 00:00:01 100 ... 1 74.5
3 2018-01-01 00:00:02 75 ... 0 75
请注意,某些bin
值的持续时间可能超过1秒,在这种情况下,我希望使用前一个值之前的值和最后一个值之后的值来计算表格IDX_alt的平均值(每个时期的所有1都具有相同的值。
我已经使用基本函数编写了代码,但它太长了,我觉得dplyr包可以将整个事情简化为几行。如果我错了,请纠正我。
编辑:
这是我拥有并正在尝试更改的代码:
Zeit U5NS A10UT I05ES E13OV
1 2018-01-01 00:00:00 74 77 60 100
2 2018-01-01 00:00:01 75 79 61 98
3 2018-01-01 00:00:02 75 79 61 95
4 2018-01-01 00:00:03 75 80 61 96
5 2018-01-01 00:00:04 75 77 60 97
6 2018-01-01 00:00:05 75 76 60 97
time = data[c(1)]
lag_fun = function(colData)
diff = colData - lag(colData)
lag_data = cbind(time, setNames(lapply(temp[2:ncol(temp)], lag_fun),
paste0(names(temp)[2:ncol(temp)], "_diff")))
head(lag_data)
Zeit U5NS_diff A10UT_diff I05ES_diff E13OV_diff
1 2018-01-01 00:00:00 NA NA NA NA
2 2018-01-01 00:00:01 1 2 1 -2
3 2018-01-01 00:00:02 0 0 0 -3
4 2018-01-01 00:00:03 0 1 0 1
5 2018-01-01 00:00:04 0 -3 -1 1
6 2018-01-01 00:00:05 0 -1 0 0
moving_average_fun = function(colData)
moving_average = roll_mean(colData, 2, align = "left", fill = 0)
moving_average_data = cbind(time, setNames(
lapply(lag_data[2:ncol(lag_data)], moving_average_fun),
paste0(names(temp)[2:ncol(temp)], "_mov_avg")
))
head(moving_average_data)
Zeit U5NS_mov_avg A10UT_mov_avg I05ES_mov_avg E13OV_mov_avg
1 2018-01-01 00:00:00 NA NA NA NA
2 2018-01-01 00:00:01 0.5 1.0 0.5 -2.5
3 2018-01-01 00:00:02 0.0 0.5 0.0 -1.0
4 2018-01-01 00:00:03 0.0 -1.0 -0.5 1.0
5 2018-01-01 00:00:04 0.0 -2.0 -0.5 0.5
6 2018-01-01 00:00:05 0.0 0.5 0.0 0.0
artefact_fun = function(colData)
artefact = as.numeric(abs(colData) > 10)
artefact_data = cbind(time, setNames(
lapply(moving_average_data[2:ncol(lag_data)], artefact_fun),
paste0(names(temp)[2:ncol(temp)], "_artefact")
))
head(artefact_data)
Zeit U5NS_artefact A10UT_artefact I05ES_artefact E13OV_artefact
1 2018-01-01 00:00:00 NA NA NA NA
2 2018-01-01 00:00:01 0 0 0 0
3 2018-01-01 00:00:02 0 0 0 0
4 2018-01-01 00:00:03 0 0 0 0
5 2018-01-01 00:00:04 0 0 0 0
6 2018-01-01 00:00:05 0 0 0 0
pre_final = merge(moving_average_data, artefact_data, by = "Zeit")
final = merge(temp, pre_final, by = "Zeit")
no_col = ncol(final)
no_elem = ncol(temp) - 1
location = no_col - no_elem + 1
j = 2
for (i in location:ncol(final)) {
ind <- which(final[i] == 1)
# catch if no artefacts
if(length(ind) != 0){
final[, j][ind] <-
sapply(ind, function(m)
with(final, mean(c(
final[, j][m - 1], final[, j][m + 1]
))))
j = j+1
} # else, move on to the next i = column
j = j+1
# print(j)
}
output = cbind(final,setNames(final[2:ncol(temp)], paste0(names(final)[2:ncol(temp)], "_altered")))
output = output[,-c(2:ncol(temp))]
output = merge(temp, output, by = "Zeit")
head(output)
Zeit U5NS A10UT I05ES E13OV U5NS_mov_avg A10UT_mov_avg I05ES_mov_avg E13OV_mov_avg
1 2018-01-01 00:00:00 74 77 60 100 NA NA NA NA
2 2018-01-01 00:00:01 75 79 61 98 0.5 1.0 0.5 -2.5
3 2018-01-01 00:00:02 75 79 61 95 0.0 0.5 0.0 -1.0
4 2018-01-01 00:00:03 75 80 61 96 0.0 -1.0 -0.5 1.0
5 2018-01-01 00:00:04 75 77 60 97 0.0 -2.0 -0.5 0.5
6 2018-01-01 00:00:05 75 76 60 97 0.0 0.5 0.0 0.0
U5NS_artefact A10UT_artefact I05ES_artefact E13OV_artefact U5NS_altered A10UT_altered I05ES_altered
1 NA NA NA NA 74 77 60
2 0 0 0 0 75 79 61
3 0 0 0 0 75 79 61
4 0 0 0 0 75 80 61
5 0 0 0 0 75 77 60
6 0 0 0 0 75 76 60
E13OV_altered
1 100
2 98
3 95
4 96
5 97
6 97
正如您所看到的,它有很多代码,我认为所有代码都可以大大简化。