假设您有此data.frame:
d <- data.frame(ID=c(1,1,1,2,2,3,3,4,4,6,6,6,8,8),FP=-c(1,2,3,1,2,3,2,1,4,1,4,3,1,4)*100)
如果您创建列矩阵:
table(d$FP,d$ID)
1 2 3 4 6 8
-400 0 0 0 1 1 1
-300 1 0 1 0 1 0
-200 1 1 1 0 0 0
-100 1 1 0 1 1 1
您会看到行'-400'(c(4,6,8)
)的列名集合是行'-100'(c(1,2,4,6,8)
)的列名集合的子集。
在这种情况下,我需要删除“超集”行,即此处的“ -100”行。
同样,您可以看到列'8'(c(-400,-100)
)的行名集合是列'6'(c(-400,-300,-100)
)的行名集合的子集。
在这种情况下,我需要删除“子集”列,即此处的“ 8”列,但出于其他考虑,我将进一步解释。
至关重要的是,我需要在创建矩阵之前/不进行此操作,因为对于d
数据类型,我正在使用的帧经常会导致内存问题。
我当前直接在原始数据上进行超集删除的方式如下:
FP
作为函数,按unique
聚合数据,以防止简化为矢量;按长度排序即:
t1 <- d
duprows <- aggregate(ID~FP,t1,function(x) {sort(unique(x))}, simplify=FALSE)
t1 <- t1[t1$FP %in% duprows[!(duplicated(duprows$ID)),][["FP"]],]
duprows["NIDs"] <- sapply(duprows[["ID"]],length)
duprows <- duprows[order(duprows[["NIDs"]]),]
# not run
#duprows
# FP ID NIDs
#1 -400 4, 6, 8 3
#2 -300 1, 3, 6 3
#3 -200 1, 2, 3 3
#4 -100 1, 2, 4, 6, 8 5
i = 1
Nconsts <- dim(duprows)[[1]]
while (i <= (Nconsts-1)) {
FPs_to_remove = numeric()
current_NIDs = duprows[["NIDs"]][i]
#find the index of the first NIDs greater than the current one
greater_NIDs_index = which(duprows[["NIDs"]]>current_NIDs)[1]
if (is.na(greater_NIDs_index)) {break} else {
ID.i = unlist(duprows[["ID"]][i])
for (j in greater_NIDs_index:Nconsts) {
ID.j = unlist(duprows[["ID"]][j])
matches = (ID.i %in% ID.j)
if (all(matches)) {
FPs_to_remove = c(FPs_to_remove, duprows[["FP"]][j]) }
}
duprows = duprows[!(duprows[["FP"]] %in% FPs_to_remove),]
Nconsts = dim(duprows)[[1]]
i = i + 1
}
}
t1 <- t1[t1$FP %in% duprows$FP,]
# not run
#duprows
# FP ID NIDs
#1 -400 4, 6, 8 3
#2 -300 1, 3, 6 3
#3 -200 1, 2, 3 3
在此示例中,此方法运行速度足够快,但是正如您可以想象的那样,对于大型数据集,这是非常灾难性的。
问题:您能否建议一种更有效的方法来达到相同的结果?
我看了软件包data.table
中的小插曲,在另一篇文章中建议我将其作为另一项任务,在这种情况下,它确实显着加快了计算速度。 [实际上,我可以使用相同的程序包删除重复的行]。
但是,对于上述任务,我看不到一种立即利用data.table
快速操作的方法。任何建议都将受到欢迎。
正如我提到的,我需要运行的另一个相关任务是删除 subset 列,但还要考虑数字属性P
。
我采用了与上述类似的策略:
ID
作为函数,按unique
聚合数据,以防止简化为矢量;按长度排序,现在也按降序P
和ID
进行排序,以防P
中有联系P
,因为行已经排序P
即:
P_vs_id <- data.frame(ID=c(1,2,3,4,6,8),P=c(0.5,0.8,0.1,0.6,0.9,0.75))
dupcols <- aggregate(FP~ID,t1,function(x) {sort(unique(x))}, simplify=FALSE)
dupcols <- merge(dupcols,P_vs_id,by="ID")
dupcols["NFPs"] <- sapply(dupcols[["FP"]],length)
dupcols <- dupcols[order(dupcols$NFPs,-dupcols$P,dupcols$ID),]
t1 <- t1[t1$ID %in% dupcols[!(duplicated(dupcols$FP)),][["ID"]],]
在此示例中,这删除了带有ID
4的列,该列与ID
8相同并且具有较低的P
。
然后,针对第3点:
dupcols <- aggregate(FP~ID,t1,function(x) {sort(unique(x))}, simplify=FALSE)
dupcols <- merge(dupcols,P_vs_id,by="ID")
dupcols["NFPs"] <- sapply(dupcols[["FP"]],length)
dupcols <- dupcols[order(dupcols$NFPs,-dupcols$P,dupcols$ID),]
# not run
#dupcols
# ID FP P NFPs
#2 2 -200 0.80 1
#4 8 -400 0.75 1
#3 6 -400, -300 0.90 2
#1 1 -300, -200 0.50 2
i = 1
NIDs <- dim(dupcols)[[1]]
while (i <= (NIDs-1)) {
current_NFPs = dupcols[["NFPs"]][i]
#find the index of the first NFPs larger than the current one
larger_NFPs_index = which(dupcols[["NFPs"]]>current_NFPs)[1]
if (is.na(larger_NFPs_index)) {break} else {
FP.i = unlist(dupcols[["FP"]][i])
P.i = unlist(dupcols[["P"]][i])
j = larger_NFPs_index
flag = 0
while (j <= NIDs) {
FP.j = unlist(dupcols[["FP"]][j])
P.j = unlist(dupcols[["P"]][j])
matches = (FP.i %in% FP.j)
if (all(matches) & (P.i <= P.j)) {
dupcols = dupcols[-i,]
flag = 1
break} else {j = j + 1}
}
NIDs = dim(dupcols)[[1]]
if (flag == 0) {i = i + 1}
}
}
# not run
#dupcols
# ID FP P NFPs
#2 2 -200 0.80 1
#3 6 -400, -300 0.90 2
#1 1 -300, -200 0.50 2
t1 <- t1[(t1$ID %in% dupcols$ID),]
在此示例中,这删除了带有ID
8的列,该列对应于ID
6的FP的FP子集的集合,并且具有较低的P
。尽管它是ID
1的列的子集,但它没有删除ID
2的列,因为前者的P
较高。
同样,对于这种小型data.frame来说,这很好,但是对于我正在处理的数据类型,这可能要花费几个小时。
我通过从检查中排除了永远不可能是其他任何集合的子集的FP.i
集,从而略微加快了速度,但这仅产生了很小的影响。
并且考虑到经常有必要重复几次删除超集和子集,因为您可以想象,删除某些行或某些列有时会改变矩阵,从而有必要进一步运行。
所以...提高效率的任何帮助确实意义重大。
谢谢!
编辑
我发现xtabs
可以制作稀疏的权变矩阵;也许那会避免我在完整矩阵中遇到的内存错误。然后,由于我上面描述的任务似乎与线性依赖性的概念密切相关,因此我也许可以使用包Matrix
进行QR分解,然后从那里获取它。
这似乎是前进的好方法吗?还是QR会和我的循环一样糟糕?
我在data.table
上做了另一番尝试(并做了短暂尝试)。除了加快原始data.frame中数据的查找速度外,我不知道我所描述的操作是否可以改进。我发现了如何编写一个函数来告知是否应删除给定的行,但是后来我无法使其在data.table
中起作用;我仍然不太熟悉语法。
编辑2
据我所知,矩阵代数方法最终是死路一条。
另一方面,在稍微研究了data.table
的语法之后,我制作了一个比上面的脚本明显更好的脚本(见下文)。
仍然,我怀疑我没有充分利用data.table
的潜力,因此,如果有任何专家用户不介意查看以下脚本并建议可以做的更好的事情,那将是很好的。
d <- data.frame(ID=c(1,1,1,2,2,3,3,4,4,6,6,6,8,8),FP=-c(1,2,3,1,2,3,2,1,4,1,4,3,1,4)*100)
#
# SETUP
#
require(data.table)
# transfer or original data.frame data into a data.table
t1 <- data.table(d)
#
# REMOVE DUPLICATED ROWS
#
# count of how many ID's there are for each FP group
t1[,NIDs:=length(ID),by="FP"]
# sort
setorder(t1,NIDs,FP,ID)
# remove FP's that contain duplicate sets of ID's (duplicate rows)
duprows <- t1[,.(ID=list(c(ID))),by="FP"]
duprows[,is.dupl:=duplicated(ID)]
t1 <- t1[(FP %in% duprows[is.dupl == FALSE,FP])]
#
# REMOVE DUPLICATED COLUMNS
#
# count of how many FP's there are for each ID group
t1[,NFPs:=length(FP),by="ID"]
# sort
setorder(t1,-NFPs,ID,FP)
# remove ID's that contain duplicate sets of FP's (duplicate columns)
dupcols <- t1[,.(FP=list(c(FP))),by="ID"]
dupcols[,is.dupl:=duplicated(FP)]
t1 <- t1[(ID %in% dupcols[is.dupl == FALSE,ID])]
#
# REMOVE SUPERSET ROWS
#
# count of how many ID's there are for each FP group
t1[,NIDs:=length(ID),by="FP"]
# sort
setorder(t1,NIDs,FP,ID)
# for each FP group, write the first and last ID (they are already sorted)
t1[,ID.f:=first(c(ID)),by="FP"]
t1[,ID.l:=last(c(ID)),by="FP"]
# create an index for each FP group
t1[,FP.ind:=.GRP,by="FP"]
# initialise FP_to_remove column
t1[,FP_to_remove:=FALSE]
# mark FP groups that contain at least one ID corresponding to only one FP: the ID's in such FP groups can never be a subset of the ID's in another FP group
t1[,unique_ID:=any(NFPs==1),by="FP"]
# calculate the maximal FP group index
FP.ind.max <- t1[nrow(t1),FP.ind]
# for each FP group, check if its ID's are a subset of the ID's of other FP groups, and if so, mark the latter for removal
i = 1
while (i < FP.ind.max) {
FP.i = t1[FP.ind == i,FP][1]
ID.i = t1[FP == FP.i,c(ID),by=FP]$V1
NIDs.i = t1[FP.ind == i,NIDs][1]
ID.f.i = t1[FP.ind == i,ID.f][1]
ID.l.i = t1[FP.ind == i,ID.l][1]
if ((t1[FP.ind == i,unique_ID][1] == FALSE) & (t1[FP.ind == i,FP_to_remove][1] == FALSE)) {
t1[(ID.f <= ID.f.i & ID.l >= ID.l.i & FP.ind > i & NIDs > NIDs.i & FP_to_remove == FALSE),FP_to_remove:=all(ID.i %in% c(ID)),by=FP.ind]
}
i = i + 1
}
t1 <- t1[FP_to_remove == FALSE]
#
# REMOVE DUPLICATED COLUMNS (which may appear after removing superset rows)
#
# count of how many FP's there are for each ID group
t1[,NFPs:=length(FP),by="ID"]
# sort
setorder(t1,-NFPs,ID,FP)
# remove ID's that contain duplicate sets of FP's (duplicate columns)
dupcols <- t1[,.(FP=list(c(FP))),by="ID"]
dupcols[,is.dupl:=duplicated(FP)]
t1 <- t1[(ID %in% dupcols[is.dupl == FALSE,ID])]
#
# REMOVE SUBSET COLUMNS
#
# count of how many ID's there are for each FP group
t1[,NIDs:=length(ID),by="FP"]
# create an index for each ID group
t1[,ID.ind:=.GRP,by="ID"]
# for each ID group, write the first and last FP (they are already sorted)
t1[,FP.f:=first(c(FP)),by="ID"]
t1[,FP.l:=last(c(FP)),by="ID"]
# initialise ID_to_remove column
t1[,ID_to_remove:=FALSE]
# mark ID groups that contain at least one FP corresponding to only one ID: the FP's in such ID groups can never be a subset of the FP's in another ID group
t1[,unique_FP:=any(NIDs==1),by="ID"]
# calculate the maximal ID group index
ID.ind.max <- t1[nrow(t1),ID.ind]
# for each ID group, check if its FP's are a superset of the FP's of other ID groups, and if so, mark the latter for removal
i = 1
while (i < ID.ind.max) {
ID.i = t1[ID.ind == i,ID][1]
FP.i = t1[ID == ID.i,c(FP),by=ID]$V1
NFPs.i = t1[ID.ind == i,NFPs][1]
FP.f.i = t1[ID.ind == i,FP.f][1]
FP.l.i = t1[ID.ind == i,FP.l][1]
if (t1[ID.ind == i,ID_to_remove][1] == FALSE) {
t1[(unique_FP == FALSE & FP.f >= FP.f.i & FP.l <= FP.l.i & ID.ind > i & NFPs < NFPs.i & ID_to_remove == FALSE),ID_to_remove:=all(c(FP) %in% FP.i),by=ID.ind]
}
i = i + 1
}
t1 <- t1[ID_to_remove == FALSE]