R - 有效地匹配矩阵中的对

时间:2017-09-22 08:26:30

标签: r loops matrix

我在WireShark的网络连接两侧都有一个组合数据包捕获。捕获导出为CSV文件,每行包含唯一ID和时间戳。因为我从双方捕获这意味着我将有两行每个ID包含发送时间戳和接收时间戳。我想要做的是通过减去这些值来计算延迟。我已经设法做到了但是我需要大约12秒来查看我的17000个数据包列表,我总共有15个列表,等于3分钟执行时间,使用以下代码:

data <- read.csv("normal-novpn.csv", sep=",", numerals="no.loss", header=TRUE)
ID = data.matrix(data[,7], rownames.force = NA)
time = data.matrix(as.double(as.character(data[,2])), rownames.force = NA)
time = time*1000000 # Time is now in microseconds

len <- nrow(ID)
mat <- matrix(,nrow=len,ncol=2)
for(i in 1:len){
    d <- unlist(strsplit(ID[i], " "))
    mat[i,1] <- as.numeric(gsub('[()]','',d[2]))
    mat[i,2] <- time[i]
}

delay = vector(length=len/2)
k <- 1
for(i in 1:len){
    for(j in i:len){
        if(mat[i,1] == mat[j,1] && mat[j,2] > mat[i,2]){
            delay[k] <- mat[j,2] - mat[i,2]
            k <- k+1
        }
    }
}

CSV文件中的行按时间排序,行如下所示:

"32","1505997726.015245358","10.0.10.70","10.0.10.1","UDP","214","0xa5f0 (42480)","50414  >  5201 Len=172"

其中时间戳为:&#34; 1505997726.015245358&#34; ID为:&#34; 0xa5f0(42480)&#34;

我的问题是,如果我能更有效地做到这一点,以减少执行时间。

更新: 这是指向我的一个包含17000行的CSV文件的链接:https://justpaste.it/1bjoy

这是一个只有10行数据+标题的小文件。有一点需要提及的是,对于所有文件而言,重复ID在列表中彼此相邻是不正确的。

"No.","Time","Source","Destination","Protocol","Length","Identification","Info"
"120","1505984967.366049706","10.0.0.50","10.0.0.35","UDP","214","0x8dab (36267)","46670  >  5201 Len=172"
"123","1505984967.366440","10.0.0.50","10.0.0.35","UDP","214","0x8dab (36267)","46670  >  5201 Len=172"
"124","1505984967.386478504","10.0.0.50","10.0.0.35","UDP","214","0x8dac (36268)","46670  >  5201 Len=172"
"125","1505984967.386606","10.0.0.50","10.0.0.35","UDP","214","0x8dac (36268)","46670  >  5201 Len=172"
"130","1505984967.406353133","10.0.0.50","10.0.0.35","UDP","214","0x8db0 (36272)","46670  >  5201 Len=172"
"131","1505984967.406555","10.0.0.50","10.0.0.35","UDP","214","0x8db0 (36272)","46670  >  5201 Len=172"
"132","1505984967.426372842","10.0.0.50","10.0.0.35","UDP","214","0x8db1 (36273)","46670  >  5201 Len=172"
"133","1505984967.426558","10.0.0.50","10.0.0.35","UDP","214","0x8db1 (36273)","46670  >  5201 Len=172"
"134","1505984967.446282356","10.0.0.50","10.0.0.35","UDP","214","0x8db6 (36278)","46670  >  5201 Len=172"
"135","1505984967.446555","10.0.0.50","10.0.0.35","UDP","214","0x8db6 (36278)","46670  >  5201 Len=172"

更新2: 必须保留行的顺序,因为我将执行新值的其他计算。第一栏&#34; No。&#34;表示WireShark计算的数据包编号,并且在遍历列表时必须增加。

1 个答案:

答案 0 :(得分:0)

以下是使用data.table的快速解决方案。文件so_long.csvthis one fromn your edit

library(data.table)
library(microbenchmark)

foo <- function() {
  dt <- fread("so_long.csv")
  dt[, Time := as.double(as.character(Time)) * 1000000]
  dt[, .(Delay = max(Time) - min(Time)), by = Identification]
}

head(foo())
# Identification   Delay
# 1:     0x0003 (3) 1749.75
# 2:     0x0004 (4) 1761.00
# 3:     0x0007 (7) 1887.50
# 4:     0x0009 (9) 1983.75
# 5:    0x000e (14) 1929.75
# 6:    0x0014 (20) 1948.50

microbenchmark(foo())
# Unit: milliseconds
# expr      min       lq     mean   median       uq      max neval
# foo() 38.28835 52.17356 64.48024 60.63322 72.21627 132.8679   100