我有一个大数据(.tr文件)。我已经读取了文件并在数据框(df)中重命名了列。我设法查看了所有现有记录并检查了某些条件。我需要计算整个文件中存在多少个唯一值(来自src.port列)?以下MWE将说明我的问题。
# The df looks like:
st time from to protocol size flags flowID src.port dst.port seq pktID
+ 0.100000 1 2 tcp 40 ------- 1 5.0 2.1 0 0
- 0.100000 5 0 ack 40 ------- 1 5.1 2.3 0 0
r 0.102032 1 2 tcp 40 ------- 1 5.20 2.5 0 0
r 0.102032 1 2 tcp 40 ------- 1 5.11 2.6 0 0
r 0.102032 1 2 tcp 40 ------- 1 3.0 2.0 0 0
+ 0.121247 11 0 ack 40 ------- 1 11.1 2.10 0 1
r 0.132032 1 2 tcp 40 ------- 1 3.0 2.0 0 0
r 0.142065 1 2 tcp 40 ------- 1 3.0 4.0 0 0
# I have tried the following:
unique<-0
for (i in 1:nrow(df)){
# feel free to suggest different way from the below line.
# I think using the name of column would be better
if(df[i,1]=="r" && df[i,3]== 1 && df[i,4]== 2 && df[i,5]== "tcp" ){
# now this condition is my question
# check if df[i,9] is new not presented before...Note 5.0 is different from 5.1
# check if df[i,10] is 2 and ignore any value after the dot (i.e 2.x ..X means any value)
# so the condition would be:
if ( df[i,9] is new not repeated && df[i,10] is 2.x)
unique<-unique+1
}
}
从样本数据预期输出:是唯一= 3
答案 0 :(得分:0)
您可以简单地对相关数据进行分组,并使用unique
。在这里,我将所有条件链接在一起并仅提取“scr.port”列并在结果上使用unique
。
unique(mydf[mydf[, "st"] == "r" &
mydf[, "from"] == 1 &
mydf[, "protocol"] == "tcp" &
grepl("^2.*", mydf[, "dst.port"]),
"src.port"])
# [1] 5.20 5.11 3.00
在length
中换行以获取您正在寻找的计数。
或者,创建数据的子集并计算行数。
out <- mydf[mydf[, "st"] == "r" &
mydf[, "from"] == 1 &
mydf[, "protocol"] == "tcp" &
grepl("^2.*", mydf[, "dst.port"]), ]
nrow(out[!duplicated(out$src.port), ])
# [1] 3