使用阈值识别重复数据

时间:2011-04-01 19:33:49

标签: r

我正在使用蓝牙传感器数据,需要为每个唯一ID识别可能的重复读数。蓝牙传感器每五秒进行一次扫描,如果设备没有快速移动(即坐在交通中),可能会在后续读数中拾取相同的设备。如果该设备进行往返,则可能有多个来自同一设备的读数,但这些读数应分开几分钟。我无法解决如何摆脱重复数据的问题。如果macid匹配,我需要计算一个时差列。

数据的格式为:

          macid   time
00:03:7A:4D:F3:59  82333
00:03:7A:EF:58:6F 223556
00:03:7A:EF:58:6F 223601
00:03:7A:EF:58:6F 232731
00:03:7A:EF:58:6F 232736
00:05:4F:0B:45:F7 164141

我需要创建:

            macid   time timediff
00:03:7A:4D:F3:59  82333 NA
00:03:7A:EF:58:6F 223556 NA
00:03:7A:EF:58:6F 223601 45
00:03:7A:EF:58:6F 232731 9310
00:03:7A:EF:58:6F 232736 5
00:05:4F:0B:45:F7 164141 NA

我对此的第一次尝试非常缓慢且无法实现:

dedupeIDs <- function (zz) {
  #Order by macid and then time
  zz <- zz[order(zz$macid, zz$time) ,]

  zz$timediff <- c(999999, diff(zz$time))

  for (i in 2:nrow(zz)) {
   if (zz[i, "macid"] == zz[i - 1, "macid"]) {
    print("Different IDs")
   } else {
    zz[i, "timediff"] <- 999999
   }
  }
  return(zz)
}

然后,我将能够根据时差列过滤data.frame。

示例数据:

structure(list(macid = structure(c(1L, 2L, 2L, 2L, 2L, 3L),
          .Label = c("00:03:7A:4D:F3:59", "00:03:7A:EF:58:6F", 
                     "00:05:4F:0B:45:F7"), class = "factor"), 
          time = c(82333, 223556, 223601, 232731, 232736, 164141)), 
          .Names = c("macid", "time"), row.names = c(NA, -6L), 
          class = "data.frame")

1 个答案:

答案 0 :(得分:5)

怎么样:

x <- structure(list(macid= structure(c(1L, 2L, 2L, 2L, 2L, 3L),
 .Label = c("00:03:7A:4D:F3:59", "00:03:7A:EF:58:6F", "00:05:4F:0B:45:F7"),
 class = "factor"), time = c(82333, 223556, 223601, 232731, 232736, 164141)),
.Names = c("macid", "time"), row.names = c(NA, -6L), class = "data.frame")
# ensure 'x' is ordered properly
x <- x[order(x$macid,x$time),]
# add timediff column by macid
x$timediff <- ave(x$time, x$macid, FUN=function(x) c(NA,diff(x)))