我有下表(结果df):
FN DC ACC CC IND
20140926-1284 552 75.05% 2232 CMC1
20140926-1286 554 50.59% 2245 CMC1
20140926-1286 552 50.64% 2232 CMC1
20140926-1299 552 58.03% 2232 CMC1
20140926-1299 554 74.53% 2254 CMC1
20140926-1300 556 68.17% 2276 CMC1
20140926-1300 552 57.31% 2232 CMC1
20140926-1301 556 68.17% 2276 CMC1
20140926-1301 552 57.31% 2232 CMC1
20140926-1301 554 74.53% 2254 CMC1
20140926-1302 556 58.17% 2276 CMC1
20140926-1302 552 57.31% 2232 CMC1
20140926-1302 554 74.53% 2254 CMC1
对于那些重复的反馈数字(重复) - 我需要检查ACC
列以及是否存在差异
ACC
列小于10% - 然后指定null(20140926-1286 - > 50.64-50.59 = 0.05),如果ACC
列中的差异大于10%,则指定最大值。那么
我的输出结果应为:
FN DC ACC CC IND
20140926-1284 552 75.05% 2232 CMC1
20140926-1286 null null null null
20140926-1299 554 74.53% 2254 CMC1
20140926-1300 556 68.17% 2276 CMC1
20140926-1301 null null null null
20140926-1302 554 74.53% 2254 CMC1
更新
我创建了独特且重复的记录作为单独的数据框 - 使用:
mylist <- split(Results, duplicated(Results$FN) | duplicated(Results$FN, fromLast = TRUE))
names(mylist) <- c("nodupe", "dupe")
list2env(mylist ,.GlobalEnv)
我在循环重复记录方面遇到问题,null
意味着在ACC
差异小于10%时分配空白值。我输入的结构看起来像
str(Results)
'data.frame': 13 obs. of 5 variables:
$ FN : Factor w/ 5 levels "20140926-1284",..: 4 5 2 3 1 2 1
$ DC : int 556 552 552 552 552 554 554
$ ACC : Factor w/ 7 levels "57.86%","95.3%",..: 1 2 3 4 5 6 7
$ CC : int 2276 2232 2232 2232 2232 2245 2245
$ IND : Factor w/ 1 level "CMC1": 1 1 1 1 1 1 1
答案 0 :(得分:1)
这是一个可选的data.table
解决方案
library(data.table)
setDT(Results)[, ACC := as.numeric(as.character(gsub("%", "", ACC)))] # Converting ACC to numeric
Results[, .SD[unique(ifelse(.N > 1 & any((ACC[ACC == max(ACC)] - ACC[ACC != max(ACC)]) < 10),
NA_integer_,
which.max(ACC)))], by = FN]
# FN DC ACC CC IND
# 1: 20140926-1284 552 75.05 2232 CMC1
# 2: 20140926-1286 NA NA NA NA
# 3: 20140926-1299 554 74.53 2254 CMC1
# 4: 20140926-1300 556 68.17 2276 CMC1
# 5: 20140926-1301 NA NA NA NA
# 6: 20140926-1302 554 74.53 2254 CMC1