Question

我正在使用Geolife Trajectories 1.3数据集（https://www.microsoft.com/en-us/download/confirmation.aspx?id=52367）它包含一堆文件夹，其中每个文件夹都是单独的用户。每个用户都有几个单独的.plt文件，其中包含GPS坐标和日期时间信息。某些用户的文件包含标签 - 时间间隔和运输类型由用户（飞机，汽车等）拍摄

我创建了两个数据集，首先包含所有用户ID，DATE-TIMES＆s等信息，与现在无关：

first dataset with users ID's and DATE's:
ID      DATE
20      2007-04-29 08:34:32
...     ...
100     2007-04-29 12:35:04

秒包含所有用户ID，StartTIME，EndTime＆和运输类型：

   ID          Start.Time            End.Time Transportation
1: 21 2007/04/29 12:34:24 2007/04/29 12:53:45           taxi
2: 21 2007/04/29 22:27:11 2007/04/30 04:28:00           car
...

来自＆＃39; StartTIME，EndTime＆＃39;第二个数据集的列我创建了具有润滑间隔的数据集：

2007-04-29 12:34:24 UTC--2007-04-29 12:53:45 UTC
...
2007-04-29 22:27:11 UTC--2007-04-30 04:28:00 UTC

比我写了2个函数：

# function for single row label processing
#   will search row's DATE in a subset of intervals for current ID
#   if TRUE - will search for a label in a subset of labels for current ID
get_label <- function(id, date, labels_subset, interval_subset) {
    # convert date to POSIX time
    single_time <- as.POSIXct(date)
    # search for current time in intervals subset and get label
    result <- labels_subset[single_time %within% interval_subset]$Transportation
    # check for result, if there is none -> return NA
    if (identical(as.vector(result), character(0))) {
        # "is type 'character' but expecting type 'logical'. Column types must be
        # consistent for each group." will raise if `return(NA)` without `as.char`
        return(as.character(NA))
    } else {
        return(as.character(result))
    }
}

和

# function for ID subset label processing
#   will create a subset of intervals for current ID
#   will create a subset of labels for current ID   
get_group <- function(tab) {
    # grep ID
    id <- tab$ID[1]
    # create interval subset for ID
    interval_subset <- intervals[labels_d$ID == id]
    # create label subset for ID
    labels_subset <- labels_d[labels_d$ID == id]
    # pass all data for get_label function -- process `tab` by row
    tab[, get_label(as.integer(ID), as.character(DATE), labels_subset, interval_subset), 1:nrow(tab)]
}

如果DATE在某个润滑间隔中，我想得到带有标签的向量如果它不在当前ID的任何润滑间隔中，则为NA。

并且tmp <- get_group(dt[ID %in% c(21, 110)])有效：

> unique(tmp$V1)
[1] NA     "car"  "walk"

但是tmp <- dt[, get_group(.SD), by = ID]无法正常工作，它只输出NA（并且dt只有两个ID＆＃39; s - 21和110）：

> unique(tmp$V1)
[1] NA

即使我只使用一个ID创建DT，function(DT)也可以，而DT[,function(.SD), by = ID]不会：

tmp<- DT[ID==21]

unique(tmp[, get_group(.SD), by = ID]$V1)
>[1] NA

unique(get_group(tmp)$V1)
>[1] NA     "car"  "walk"

为什么，我做错了什么？

UPD： 我应该早点打印.SD。默认情况下，R不会将by=参数传递给.SD，因此我的函数无法实现ID。可悲的是，没有标准的警告。 .SDcols做了诀窍：

tmp[, get_group(.SD), by = ID, .SDcols=c('ID', 'DATE')]

Answer 1

您可以按如下方式执行data.table非equi连接：

ds2[ds1, on=.(ID, Start.Time <= DATE, End.Time >= DATE)]

`DT [，function（。SD），by = ID]`的行为与`function（DT [ID％in％ID_GROUP]）不同

1 个答案: