使用dplyr

时间:2018-06-18 22:05:25

标签: r dplyr

我刚刚开始学习dplyr并且正在努力结合多种功能。

我有以下数据集。

Time                    ID1   ID2   ID3
1 2018-01-01 00:00:00   74    77    60
2 2018-01-01 00:00:01   75    79    61
3 2018-01-01 00:00:02   75    79    61

我尝试首先计算每秒更改次数(使用lag(.))并将名为IDX_lag的3个新列追加到初始数据集中。然后,使用roll_mean(colData, 2, align = "left", fill = 0)计算滞后列的移动平均值,并附加名为IDX_mov_avg的3个新列。然后,为IDX_bin的每个ID创建一个名为IDX_mov_avg > 10的二进制列。最后 - 这可能是最复杂的部分 - 对于二进制列中的每个1,计算一个名为IDX_alt的新列,其中包含1之前和之后的原始ID值的平均值。例如:

Time                    ID1   [other columns]   ID1_bin   ID1_alt
1 2018-01-01 00:00:00   74     ...              0         74
2 2018-01-01 00:00:01   100    ...              1         74.5
3 2018-01-01 00:00:02   75     ...              0         75

请注意,某些bin值的持续时间可能超过1秒,在这种情况下,我希望使用前一个值之前的值和最后一个值之后的值来计算表格IDX_alt的平均值(每个时期的所有1都具有相同的值。

我已经使用基本函数编写了代码,但它太长了,我觉得dplyr包可以将整个事情简化为几行。如果我错了,请纠正我。

编辑

这是我拥有并正在尝试更改的代码:

                     Zeit U5NS A10UT I05ES E13OV
    1 2018-01-01 00:00:00   74    77    60   100
    2 2018-01-01 00:00:01   75    79    61    98
    3 2018-01-01 00:00:02   75    79    61    95
    4 2018-01-01 00:00:03   75    80    61    96
    5 2018-01-01 00:00:04   75    77    60    97
    6 2018-01-01 00:00:05   75    76    60    97

time = data[c(1)]

lag_fun = function(colData)
  diff = colData - lag(colData)

lag_data = cbind(time, setNames(lapply(temp[2:ncol(temp)], lag_fun),
                                paste0(names(temp)[2:ncol(temp)], "_diff")))

head(lag_data)

                 Zeit U5NS_diff A10UT_diff I05ES_diff E13OV_diff
1 2018-01-01 00:00:00        NA         NA         NA         NA
2 2018-01-01 00:00:01         1          2          1         -2
3 2018-01-01 00:00:02         0          0          0         -3
4 2018-01-01 00:00:03         0          1          0          1
5 2018-01-01 00:00:04         0         -3         -1          1
6 2018-01-01 00:00:05         0         -1          0          0

moving_average_fun = function(colData)
  moving_average = roll_mean(colData, 2, align = "left", fill = 0)

moving_average_data = cbind(time, setNames(
  lapply(lag_data[2:ncol(lag_data)], moving_average_fun),
  paste0(names(temp)[2:ncol(temp)], "_mov_avg")
))

head(moving_average_data)

                 Zeit U5NS_mov_avg A10UT_mov_avg I05ES_mov_avg E13OV_mov_avg
1 2018-01-01 00:00:00           NA            NA            NA            NA
2 2018-01-01 00:00:01          0.5           1.0           0.5          -2.5
3 2018-01-01 00:00:02          0.0           0.5           0.0          -1.0
4 2018-01-01 00:00:03          0.0          -1.0          -0.5           1.0
5 2018-01-01 00:00:04          0.0          -2.0          -0.5           0.5
6 2018-01-01 00:00:05          0.0           0.5           0.0           0.0

artefact_fun = function(colData)
  artefact = as.numeric(abs(colData) > 10)

artefact_data = cbind(time, setNames(
  lapply(moving_average_data[2:ncol(lag_data)], artefact_fun),
  paste0(names(temp)[2:ncol(temp)], "_artefact")
))

head(artefact_data)

                 Zeit U5NS_artefact A10UT_artefact I05ES_artefact E13OV_artefact
1 2018-01-01 00:00:00            NA             NA             NA             NA
2 2018-01-01 00:00:01             0              0              0              0
3 2018-01-01 00:00:02             0              0              0              0
4 2018-01-01 00:00:03             0              0              0              0
5 2018-01-01 00:00:04             0              0              0              0
6 2018-01-01 00:00:05             0              0              0              0

pre_final = merge(moving_average_data, artefact_data, by = "Zeit")
final = merge(temp, pre_final, by = "Zeit")

no_col = ncol(final)
no_elem = ncol(temp) - 1
location = no_col - no_elem + 1

j = 2

for (i in location:ncol(final)) {
  ind <- which(final[i] == 1)
  # catch if no artefacts 
  if(length(ind) != 0){
    final[, j][ind] <-
      sapply(ind, function(m)
        with(final, mean(c(
          final[, j][m - 1], final[, j][m + 1]
        ))))
    j = j+1
  } # else, move on to the next i = column 
  j = j+1
  # print(j)
  }

output = cbind(final,setNames(final[2:ncol(temp)], paste0(names(final)[2:ncol(temp)], "_altered")))
output = output[,-c(2:ncol(temp))]
output = merge(temp, output, by = "Zeit")

head(output)

 Zeit U5NS A10UT I05ES E13OV U5NS_mov_avg A10UT_mov_avg I05ES_mov_avg E13OV_mov_avg
1 2018-01-01 00:00:00   74    77    60   100           NA            NA            NA            NA
2 2018-01-01 00:00:01   75    79    61    98          0.5           1.0           0.5          -2.5
3 2018-01-01 00:00:02   75    79    61    95          0.0           0.5           0.0          -1.0
4 2018-01-01 00:00:03   75    80    61    96          0.0          -1.0          -0.5           1.0
5 2018-01-01 00:00:04   75    77    60    97          0.0          -2.0          -0.5           0.5
6 2018-01-01 00:00:05   75    76    60    97          0.0           0.5           0.0           0.0
  U5NS_artefact A10UT_artefact I05ES_artefact E13OV_artefact U5NS_altered A10UT_altered I05ES_altered
1            NA             NA             NA             NA           74            77            60
2             0              0              0              0           75            79            61
3             0              0              0              0           75            79            61
4             0              0              0              0           75            80            61
5             0              0              0              0           75            77            60
6             0              0              0              0           75            76            60
  E13OV_altered
1           100
2            98
3            95
4            96
5            97
6            97

正如您所看到的,它有很多代码,我认为所有代码都可以大大简化。

0 个答案:

没有答案