按多个条件删除重复项

时间:2017-11-13 15:22:18

标签: r dplyr tidyr tidyverse

我有一个数据,其中个人(姓名)在eggphase类别中多次出现。我希望每个人只有一个样本,但我不想只保留R发现的第一个样本。我想保留该组在所有其他类别中出现最多的那个。希望我的例子有助于明确这一点。

library(tidyverse)
myDF <- read.table(text="Tissue Food Eggphase Name Group
  wb fl after Kia a
  wb fl after Kia c
  wb wf before Kia b
  wb fl before Lucy c
  wb fl after Lucy b
  wb fl after Lucy c
  wb fl yolkdep Jess c
  wb fl yolkdep Betty a
  wb fl yolkdep Betty b", header = TRUE)

我想保留Name出现的行,一旦按Tissue,Food和Eggphase分组,但我想选择Group出现的行,如果不是所有不同的eggphases(使用相同的Tissue和Food组合)。< / p>

   #results I want
  Tissue Food Eggphase  Name Group
1     wb   fl    after   Kia     c
2     wb   wf   before   Kia     b
3     wb   fl   before  Lucy     c
4     wb   fl    after  Lucy     c
5     wb   fl  yolkdep  Jess     c
6     wb   fl  yolkdep Betty     b

我试过

one_bird <- myDF %>% 
  distinct(Tissue, Food, Eggphase, Name, .keep_all = TRUE)

但它只保留第一个条目

  Tissue Food Eggphase  Name Group
1     wb   fl    after   Kia     a
2     wb   wf   before   Kia     b
3     wb   fl   before  Lucy     c
4     wb   fl    after  Lucy     b
5     wb   fl  yolkdep  Jess     c
6     wb   fl  yolkdep Betty     b

如何告诉它选择Group Tissue组合中大多数(如果不是全部)eggphases Food出现的行? 在我的示例中,TissueFood wbfl cb组合中显示最多的组是KiaGroup,但是{{ 1}}不会显示在b c中,因此Group是更好的选择。就像这个例子一样,我的数据有来自不是最常见的itemswithscore = [5675, 0], [6676, 0], [9898, 0], [4545, 0] itemswithlicense = [9999, 'ATR'], [9191, 'OPOP'], [9898, 'THIS'], [2222, 'PLPL'] for sublist1 in itemswithscore: for sublist2 in itemswithlicense: if sublist1[0] == sublist2[0]: #this is the "inner join" :) if sublist2[1] == 'THIS': #It has to be license 'THIS' sublist1[1] += 50 #I add 50 to the score value 组的重复项,如何让它选择下一个最常见的那一行?

我希望我已经有足够的理解。

3 个答案:

答案 0 :(得分:2)

一种选择是创建一个由&#39; Tissue&#39;,&#39; Food&#39; Group&#39;分组的频率列,然后执行降序{{1} }&#39; n&#39;并使用[ 0.625 -0.125 -0.125 0.125 0.625 -0.125 -0.125 0.125]

[0.625, -0.125, -0.125,  0.125,  0.625, -0.125, -0.125,  0.125]

答案 1 :(得分:0)

我想这篇文章和答案应该让我有理由学习dplyr和tidyverse,但是既然我已经努力给出一个有效的答案,那么它就是:

myDF <- read.table(text="Tissue Food Eggphase Name Group
  wb fl after Kia a
  wb fl after Kia c
  wb wf before Kia b
  wb fl before Lucy c
  wb fl after Lucy b
  wb fl after Lucy c
  wb fl yolkdep Jess c
  wb fl yolkdep Betty a
  wb fl yolkdep Betty b", header = TRUE)

# I usually have the following setting active: options(stringsAsFactors=F)
# The following might error without such a setting

# Create a var that indicates a duplicate or a record with a duplicate
myDF$duplicate <- duplicated(myDF[,c('Name','Eggphase','Tissue','Food')])
myDF$duplicate <- ifelse(duplicated(myDF[,c('Name','Eggphase','Tissue','Food')],fromLast=T),yes=T, no=myDF$duplicate)

# Count eggphases by group 
eggphaseCount <- with(myDF,aggregate(x=list(Group_phaseCt=Eggphase),by=list(Group=Group),FUN=function(x) length(unique(x))))
# Merge to DF
myDF <- merge(myDF,eggphaseCount,by='Group',all=T)

# Get the max # of egphases by name
scale <- with(myDF,aggregate(x=list(PhaseMax=Group_phaseCt),by=list(Name=Name),FUN=max))
# Add to DF
myDF <- merge(myDF,scale,by='Name',all=T)

# Take the ratio
myDF$bestRatio <- with(myDF,Group_phaseCt/PhaseMax)
# Keep only those that aren't a duplicate, or are a duplicate and have the highest ratio
myDF2 <- myDF[with(myDF,which(duplicate==FALSE | (duplicate==TRUE & bestRatio==1))),]

答案 2 :(得分:0)

嘿嘿thanx为你的家伙帮助!!你建议的组合似乎有效:

# Create a var that indicates a duplicate or a record with a duplicate
myDF$duplicate <- duplicated(myDF[,c('Name','Eggphase','Tissue','Food')])
#this won't tell you that the first entry og the combination is double
# so need to make them check against the previous row
myDF$duplicate <- ifelse(duplicated(myDF[,c('Name','Eggphase','Tissue','Food')],fromLast=T),yes=T, no=myDF$duplicate)

# Count eggphases by group 
eggphaseCount <- with(myDF,aggregate(x=list(Group_phaseCt=Eggphase),by=list(Group=Group),FUN=function(x) length(unique(x))))
# Merge to DF
myDF <- merge(myDF,eggphaseCount,by='Group',all=T)

# Get the max # of egphases by name
scale <- with(myDF,aggregate(x=list(PhaseMax=Group_phaseCt),by=list(Name=Name),FUN=max))
# Add to DF
myDF <- merge(myDF,scale,by='Name',all=T)

# Take the ratio
myDF$bestRatio <- with(myDF,Group_phaseCt/PhaseMax)

# make new df without duplicates
myDF2 <- myDF %>% 
#arrange in a way that the first duplicate is from the group with the most eggphases
#and the name appears in the most egg phases 
  arrange(Tissue, Food, Eggphase, Name, Group, desc(Group_phaseCt), desc(PhaseMax)) %>% 
#select only distinct rows according to specified var keep all others
  distinct(Tissue, Food, Eggphase, Name, .keep_all = TRUE)