我有两个data.table
个多列作为键(它们由列record
,dstPort
,srcPort
,proto
,{{1 }和dstIP
)。
两者都有相同的格式。
dataset_1:
srcIP
和dataset_2:
record dstPort srcPort proto dstIP srcIP state timestamp 1: state 80 32768 tcp 192.168.101.5 192.168.101.89 syn 1466580661185059 2: state 80 32768 tcp 192.168.101.5 192.168.101.89 syn_ack 1466520661604781 3: state 80 32768 tcp 192.168.101.5 192.168.101.89 close 1466532661885439 4: state 80 55555 tcp 192.168.101.5 192.168.101.89 syn 1466532661885440
以下是我想对数据集中的每个键做的事情:
我想找到具有相同键的记录(行)以及给定状态可用的位置(即dataset_1中的状态 record dstPort srcPort proto dstIP srcIP state timestamp
1: state 80 32768 tcp 192.168.101.5 192.168.101.89 established 1466537661727619
2: state 80 32768 tcp 192.168.101.5 192.168.101.89 close 1466532661986891
3: state 80 44444 tcp 192.168.101.5 192.168.101.89 established 1466537661727619
和dataset_2中的syn
。
对于这些记录,我想减去彼此的时间戳。
即:
对于dataset_1中的每个Key,即:
状态established
的 state 80 32768 tcp 192.168.101.5 192.168.101.89
会给出时间戳syn
和键入数据集_2:
状态1466580661185059
的 state 80 32768 tcp 192.168.101.5 192.168.101.89
会给出时间戳established
减去时间戳后: 1466580661185059-1466537661727619 = 42999457440
可能是数据集_2中没有关键字的记录。这就是排序不起作用的原因(这是我所有尝试所依据的)。 示例性尝试是(在使它们排序之后不再可能):
1466537661727619
更新1: @lmo:
dt_state1 <- subset(dt, state == 'established')
dt_state2 <- subset(dt, state == 'syn')
dt_delta_test <- data.table(x=(dt_state1$timestamp/1000)- (dt_state2$timestamp/1000),'timestamp'= dt_state1$timestamp-min(dt_state1$timestamp))
我不知道为什么会出现这种错误..
@ toni057你的解决方案对我没有任何改变(我不得不做一些改变,因为它引发了一些错误)。我尝试了以下代码:
F1_in = as.data.table(read.csv(file=Filename, header=TRUE, sep=","))
keys=c("record","dstPort","srcPort","dstIP","srcIP")
state1 = 'syn'
state2 = 'established'
dt_state1 <- subset(F1_in, state == state2)
setkey(dt_state1, keys)
Error in setkeyv(x, cols, verbose = verbose, physical = physical) : some columns are not in the data.table: keys
dt_state2 <- subset(F1_in, state == state1)
setkey(dt_state2, keys)
Error in setkeyv(x, cols, verbose = verbose, physical = physical) : some columns are not in the data.table: keys
dt_state1[dt_state2, timestamp - i.timestamp]
Error in `[.data.table`(dt_state1, dt_state2, timestamp - i.timestamp) :
When i is a data.table (or character vector), x must be keyed (i.e. sorted, and, marked as sorted) so data.table knows which columns to join to and take advantage of x being sorted. Call setkey(x,...) first, see ?setkey.
我也改变了第二个过滤器的dt。但是dt_state1根本没有变化..
答案 0 :(得分:2)
如果你的目标是获取两个data.tables之间的时间差,它们都共享相同的密钥,你可以使用左连接,然后计算差异:
# get stuff set up
library(data.table)
# convert data.frames to data.tables by reference
setDT(dt_state1)
setDT(dt_state2)
# set keys
setkey(dt_state1, record, dstPort, srcPort, proto, dstIP, srcIP)
setkey(dt_state2, record, dstPort, srcPort, proto, dstIP, srcIP)
# perform left join and get timestamp difference
dt_state1[dt_state2, timestamp - i.timestamp]
[1] 42999457440 -17000122838 -4999842180 47999198168 -12000382110 -101452 NA
执行左连接(将dt_state1中的观察值设置为仅包括dt_state2中的观察值)并从dt_state1中减去dt_state2的时间戳。
返回向量的第一个条目是您在示例中列出的值。
数据强>
dt_state1 <- read.table(header=T, text="
record dstPort srcPort proto dstIP srcIP state timestamp
1: state 80 32768 tcp 192.168.101.5 192.168.101.89 syn 1466580661185059
2: state 80 32768 tcp 192.168.101.5 192.168.101.89 syn_ack 1466520661604781
3: state 80 32768 tcp 192.168.101.5 192.168.101.89 close 1466532661885439
4: state 80 55555 tcp 192.168.101.5 192.168.101.89 syn 1466532661885440")
dt_state2 <- read.table(header=T, text="
record dstPort srcPort proto dstIP srcIP state timestamp
1: state 80 32768 tcp 192.168.101.5 192.168.101.89 established 1466537661727619
2: state 80 32768 tcp 192.168.101.5 192.168.101.89 close 1466532661986891
3: state 80 44444 tcp 192.168.101.5 192.168.101.89 established 1466537661727619")
答案 1 :(得分:0)
library(dplyr)
dt_state1 %>%
filter(state == 'syn') %>%
left_join(filter(dt_state2, state == 'established), by = insert all you keys here) %>%
mutate(timestamp_diff = timestamp.x - timestamp.y)