如何在不建立矩阵的情况下从数据中删除子集或超集?

时间:2018-12-10 06:59:23

标签: r dataframe data.table

假设您有此data.frame:

d <- data.frame(ID=c(1,1,1,2,2,3,3,4,4,6,6,6,8,8),FP=-c(1,2,3,1,2,3,2,1,4,1,4,3,1,4)*100)

如果您创建列矩阵:

table(d$FP,d$ID)

       1 2 3 4 6 8
  -400 0 0 0 1 1 1
  -300 1 0 1 0 1 0
  -200 1 1 1 0 0 0
  -100 1 1 0 1 1 1

您会看到行'-400'(c(4,6,8))的列名集合是行'-100'(c(1,2,4,6,8))的列名集合的子集。
在这种情况下,我需要删除“超集”行,即此处的“ -100”行。
同样,您可以看到列'8'(c(-400,-100))的行名集合是列'6'(c(-400,-300,-100))的行名集合的子集。
在这种情况下,我需要删除“子集”列,即此处的“ 8”列,但出于其他考虑,我将进一步解释。

至关重要的是,我需要在创建矩阵之前/不进行此操作,因为对于d数据类型,我正在使用的帧经常会导致内存问题。

我当前直接在原始数据上进行超集删除的方式如下:

  1. 使用FP作为函数,按unique聚合数据,以防止简化为矢量;按长度排序
  2. 删除重复的ID集,因为根据定义,任何重复也是超集
  3. 针对所有其他较大长度的ID集上的每个ID集循环,如果它们是超集,则删除后者

即:

t1 <- d
duprows <- aggregate(ID~FP,t1,function(x) {sort(unique(x))}, simplify=FALSE)
t1 <- t1[t1$FP %in% duprows[!(duplicated(duprows$ID)),][["FP"]],]
duprows["NIDs"] <- sapply(duprows[["ID"]],length)
duprows <- duprows[order(duprows[["NIDs"]]),]

# not run
#duprows
#    FP            ID NIDs
#1 -400       4, 6, 8    3
#2 -300       1, 3, 6    3
#3 -200       1, 2, 3    3
#4 -100 1, 2, 4, 6, 8    5

i = 1
Nconsts <- dim(duprows)[[1]]

while (i <= (Nconsts-1)) {

    FPs_to_remove = numeric()
    current_NIDs = duprows[["NIDs"]][i]

    #find the index of the first NIDs greater than the current one
    greater_NIDs_index = which(duprows[["NIDs"]]>current_NIDs)[1]

    if (is.na(greater_NIDs_index)) {break} else {

        ID.i = unlist(duprows[["ID"]][i])

        for (j in greater_NIDs_index:Nconsts) {

            ID.j = unlist(duprows[["ID"]][j])

            matches = (ID.i %in% ID.j)

            if (all(matches)) {
                FPs_to_remove = c(FPs_to_remove, duprows[["FP"]][j]) }
        }
        duprows = duprows[!(duprows[["FP"]] %in% FPs_to_remove),]
        Nconsts = dim(duprows)[[1]]
        i = i + 1
    }
}

t1 <- t1[t1$FP %in% duprows$FP,]

# not run
#duprows
#    FP      ID NIDs
#1 -400 4, 6, 8    3
#2 -300 1, 3, 6    3
#3 -200 1, 2, 3    3

在此示例中,此方法运行速度足够快,但是正如您可以想象的那样,对于大型数据集,这是非常灾难性的。

问题:您能否建议一种更有效的方法来达到相同的结果?

我看了软件包data.table中的小插曲,在另一篇文章中建议我将其作为另一项任务,在这种情况下,它确实显着加快了计算速度。 [实际上,我可以使用相同的程序包删除重复的行]。 但是,对于上述任务,我看不到一种立即利用data.table快速操作的方法。任何建议都将受到欢迎。

正如我提到的,我需要运行的另一个相关任务是删除 subset 列,但还要考虑数字属性P

我采用了与上述类似的策略:

  1. 使用ID作为函数,按unique聚合数据,以防止简化为矢量;按长度排序,现在也按降序PID进行排序,以防P中有联系
  2. 删除重复的FP集,因为根据定义,任何重复也是子集;保留最大的P,因为行已经排序
  3. 在所有其他长度较大的FP集上为每个FP集进行
  4. 循环,如果前者是子集且具有较小的P
  5. ,则删除前者

即:

P_vs_id <- data.frame(ID=c(1,2,3,4,6,8),P=c(0.5,0.8,0.1,0.6,0.9,0.75))
dupcols <- aggregate(FP~ID,t1,function(x) {sort(unique(x))}, simplify=FALSE)
dupcols <- merge(dupcols,P_vs_id,by="ID")
dupcols["NFPs"] <- sapply(dupcols[["FP"]],length)
dupcols <- dupcols[order(dupcols$NFPs,-dupcols$P,dupcols$ID),]
t1 <- t1[t1$ID %in% dupcols[!(duplicated(dupcols$FP)),][["ID"]],]

在此示例中,这删除了带有ID 4的列,该列与ID 8相同并且具有较低的P

然后,针对第3点:

dupcols <- aggregate(FP~ID,t1,function(x) {sort(unique(x))}, simplify=FALSE)
dupcols <- merge(dupcols,P_vs_id,by="ID")
dupcols["NFPs"] <- sapply(dupcols[["FP"]],length)
dupcols <- dupcols[order(dupcols$NFPs,-dupcols$P,dupcols$ID),]

# not run
#dupcols
#  ID         FP    P NFPs
#2  2       -200 0.80    1
#4  8       -400 0.75    1
#3  6 -400, -300 0.90    2
#1  1 -300, -200 0.50    2

i = 1
NIDs <- dim(dupcols)[[1]]

while (i <= (NIDs-1)) {

  current_NFPs = dupcols[["NFPs"]][i]

  #find the index of the first NFPs larger than the current one
  larger_NFPs_index = which(dupcols[["NFPs"]]>current_NFPs)[1]

  if (is.na(larger_NFPs_index)) {break} else {

    FP.i = unlist(dupcols[["FP"]][i])
    P.i = unlist(dupcols[["P"]][i])
    j = larger_NFPs_index
    flag = 0

    while (j <= NIDs) {

      FP.j = unlist(dupcols[["FP"]][j])
      P.j = unlist(dupcols[["P"]][j])

      matches = (FP.i %in% FP.j)

      if (all(matches) & (P.i <= P.j)) {
        dupcols = dupcols[-i,]
        flag = 1
        break} else {j = j + 1}
    }

    NIDs = dim(dupcols)[[1]]
    if (flag == 0) {i = i + 1}
  }
}

# not run
#dupcols
#  ID         FP    P NFPs
#2  2       -200 0.80    1
#3  6 -400, -300 0.90    2
#1  1 -300, -200 0.50    2

t1 <- t1[(t1$ID %in% dupcols$ID),]

在此示例中,这删除了带有ID 8的列,该列对应于ID 6的FP的FP子集的集合,并且具有较低的P。尽管它是ID 1的列的子集,但它没有删除ID 2的列,因为前者的P较高。

同样,对于这种小型data.frame来说,这很好,但是对于我正在处理的数据类型,这可能要花费几个小时。
我通过从检查中排除了永远不可能是其他任何集合的子集的FP.i集,从而略微加快了速度,但这仅产生了很小的影响。
并且考虑到经常有必要重复几次删除超集和子集,因为您可以想象,删除某些行或某些列有时会改变矩阵,从而有必要进一步运行。

所以...提高效率的任何帮助确实意义重大。

谢谢!


进一步研究后

编辑

我发现xtabs可以制作稀疏的权变矩阵;也许那会避免我在完整矩阵中遇到的内存错误。然后,由于我上面描述的任务似乎与线性依赖性的概念密切相关,因此我也许可以使用包Matrix进行QR分解,然后从那里获取它。
这似乎是前进的好方法吗?还是QR会和我的循环一样糟糕?

我在data.table上做了另一番尝试(并做了短暂尝试)。除了加快原始data.frame中数据的查找速度外,我不知道我所描述的操作是否可以改进。我发现了如何编写一个函数来告知是否应删除给定的行,但是后来我无法使其在data.table中起作用;我仍然不太熟悉语法。


编辑2

据我所知,矩阵代数方法最终是死路一条。 另一方面,在稍微研究了data.table的语法之后,我制作了一个比上面的脚本明显更好的脚本(见下文)。

仍然,我怀疑我没有充分利用data.table的潜力,因此,如果有任何专家用户不介意查看以下脚本并建议可以做的更好的事情,那将是很好的。

d <- data.frame(ID=c(1,1,1,2,2,3,3,4,4,6,6,6,8,8),FP=-c(1,2,3,1,2,3,2,1,4,1,4,3,1,4)*100)

#
# SETUP
#

require(data.table)

# transfer or original data.frame data into a data.table

t1 <- data.table(d)

#
# REMOVE DUPLICATED ROWS
#

# count of how many ID's there are for each FP group

t1[,NIDs:=length(ID),by="FP"]

# sort

setorder(t1,NIDs,FP,ID)

# remove FP's that contain duplicate sets of ID's (duplicate rows)

duprows <- t1[,.(ID=list(c(ID))),by="FP"]
duprows[,is.dupl:=duplicated(ID)]
t1 <- t1[(FP %in% duprows[is.dupl == FALSE,FP])]

#
# REMOVE DUPLICATED COLUMNS
#

# count of how many FP's there are for each ID group

t1[,NFPs:=length(FP),by="ID"]

# sort

setorder(t1,-NFPs,ID,FP)

# remove ID's that contain duplicate sets of FP's (duplicate columns)

dupcols <- t1[,.(FP=list(c(FP))),by="ID"]
dupcols[,is.dupl:=duplicated(FP)]
t1 <- t1[(ID %in% dupcols[is.dupl == FALSE,ID])]

#
# REMOVE SUPERSET ROWS
#

# count of how many ID's there are for each FP group

t1[,NIDs:=length(ID),by="FP"]

# sort

setorder(t1,NIDs,FP,ID)

# for each FP group, write the first and last ID (they are already sorted)

t1[,ID.f:=first(c(ID)),by="FP"]
t1[,ID.l:=last(c(ID)),by="FP"]

# create an index for each FP group

t1[,FP.ind:=.GRP,by="FP"]

# initialise FP_to_remove column

t1[,FP_to_remove:=FALSE]

# mark FP groups that contain at least one ID corresponding to only one FP: the ID's in such FP groups can never be a subset of the ID's in another FP group

t1[,unique_ID:=any(NFPs==1),by="FP"]

# calculate the maximal FP group index

FP.ind.max <- t1[nrow(t1),FP.ind]

# for each FP group, check if its ID's are a subset of the ID's of other FP groups, and if so, mark the latter for removal

i = 1
while (i < FP.ind.max) {
  FP.i = t1[FP.ind == i,FP][1]
  ID.i = t1[FP == FP.i,c(ID),by=FP]$V1
  NIDs.i = t1[FP.ind == i,NIDs][1]
  ID.f.i = t1[FP.ind == i,ID.f][1]
  ID.l.i = t1[FP.ind == i,ID.l][1]  
  if ((t1[FP.ind == i,unique_ID][1] == FALSE) & (t1[FP.ind == i,FP_to_remove][1] == FALSE)) {
    t1[(ID.f <= ID.f.i & ID.l >= ID.l.i & FP.ind > i & NIDs > NIDs.i & FP_to_remove == FALSE),FP_to_remove:=all(ID.i %in% c(ID)),by=FP.ind]
  }
  i = i + 1
}
t1 <- t1[FP_to_remove == FALSE]

#
# REMOVE DUPLICATED COLUMNS (which may appear after removing superset rows)
#

# count of how many FP's there are for each ID group

t1[,NFPs:=length(FP),by="ID"]

# sort
setorder(t1,-NFPs,ID,FP)

# remove ID's that contain duplicate sets of FP's (duplicate columns)

dupcols <- t1[,.(FP=list(c(FP))),by="ID"]
dupcols[,is.dupl:=duplicated(FP)]
t1 <- t1[(ID %in% dupcols[is.dupl == FALSE,ID])]

#
# REMOVE SUBSET COLUMNS
#

# count of how many ID's there are for each FP group

t1[,NIDs:=length(ID),by="FP"]

# create an index for each ID group

t1[,ID.ind:=.GRP,by="ID"]

# for each ID group, write the first and last FP (they are already sorted)

t1[,FP.f:=first(c(FP)),by="ID"]
t1[,FP.l:=last(c(FP)),by="ID"]

# initialise ID_to_remove column

t1[,ID_to_remove:=FALSE]

# mark ID groups that contain at least one FP corresponding to only one ID: the FP's in such ID groups can never be a subset of the FP's in another ID group

t1[,unique_FP:=any(NIDs==1),by="ID"]

# calculate the maximal ID group index

ID.ind.max <- t1[nrow(t1),ID.ind]

# for each ID group, check if its FP's are a superset of the FP's of other ID groups, and if so, mark the latter for removal

i = 1
while (i < ID.ind.max) {
  ID.i = t1[ID.ind == i,ID][1]
  FP.i = t1[ID == ID.i,c(FP),by=ID]$V1
  NFPs.i = t1[ID.ind == i,NFPs][1]
  FP.f.i = t1[ID.ind == i,FP.f][1]
  FP.l.i = t1[ID.ind == i,FP.l][1]  
  if (t1[ID.ind == i,ID_to_remove][1] == FALSE) {
    t1[(unique_FP == FALSE & FP.f >= FP.f.i & FP.l <= FP.l.i & ID.ind > i & NFPs < NFPs.i & ID_to_remove == FALSE),ID_to_remove:=all(c(FP) %in% FP.i),by=ID.ind]
  }
  i = i + 1
}
t1 <- t1[ID_to_remove == FALSE]

0 个答案:

没有答案