如何在R中的此“ df”中为重复的行分配唯一的代码?

时间:2018-11-20 11:20:03

标签: r dataframe duplicates id rowname

我有这个数据框df

df <- data.frame(stringsAsFactors=FALSE,
          id = c(1L, 2L, 3L, 4L, 5L, 6L),
     Country = c("ESP", "ESP", "ESP", "ITA", "ITA", "ITA"),
        Year = c(1965L, 1965L, 1965L, 1965L, 1965L, 1965L),
   Time.step = c("Month", "Month", "Month", "Month", "Month", "Month"),
    GSA.numb = c("GSA 5", "GSA 5", "GSA 5", "GSA 17", "GSA 17", "GSA 17"),
     Species = c("Mullus", "Mullus", "Mullus", "Eledone", "Eledone", "Eledone"),
    Quantity = c(500L, 200L, 200L, 350L, 350L, 125L)
                )

df

   id  Country   Year    Time.step    GSA.numb  Species   Quantity
    1    ESP     1965     Month       GSA 5      Mullus     500   
    2    ESP     1965     Month       GSA 5      Mullus     200  
    3    ESP     1965     Month       GSA 5      Mullus     200 
    4    ITA     1965     Month       GSA 17     Eledone    350
    5    ITA     1965     Month       GSA 17     Eledone    350 
    6    ITA     1965     Month       GSA 17     Eledone    125

我有一些重复的行,例如:3和5。 当行重复时,我可以为F或T逻辑值创建一列:

df$dup <- duplicated(df[,2:7]) #No id! 

结果:

id  Country   Year    Time.step    GSA.numb  Species   Quantity dup
 1    ESP     1965     Month       GSA 5      Mullus     500   FALSE
 2    ESP     1965     Month       GSA 5      Mullus     200   FALSE
 3    ESP     1965     Month       GSA 5      Mullus     200   TRUE
 4    ITA     1965     Month       GSA 17     Eledone    350   FALSE
 5    ITA     1965     Month       GSA 17     Eledone    350   TRUE
 6    ITA     1965     Month       GSA 17     Eledone    125   FALSE

现在,我想要一个新列(以动态方式,我的真实df非常大,有很多行,列和变量),当为TRUE时可以查看重复的行数,像这样:

aspected.df

id  Country Year  Time.step  GSA.numb  Species   Quantity dup  ref  
 1  ESP     1965  Month      GSA 5      Mullus     500   FALSE NA
 2  ESP     1965  Month      GSA 5      Mullus     200   FALSE NA
 3  ESP     1965  Month      GSA 5      Mullus     200   TRUE  =id2
 4  ITA     1965  Month      GSA 17     Eledone    350   FALSE NA
 5  ITA     1965  Month      GSA 17     Eledone    350   TRUE  =id4
 6  ITA     1965  Month      GSA 17     Eledone    125   FALSE NA

我尝试过:

with(df, ave(as.character(Species), df[,2:6], FUN = make.unique)) 

但结果是:

[1] "Mullus"    "Mullus.1"  "Mullus.2"  "Eledone"   "Eledone.1" "Eledone.2"

我认为我需要更多的代码输入。哪些功能有用? (duplicated,make.unit, row.names等等...)

4 个答案:

答案 0 :(得分:4)

从初始文件开始的data.table方法:

library(data.table)

setDT(df)[, `:=` (dup = seq_len(.N) > 1, ref = paste0("id", first(id))), 
          by = .(Country, Year, Time.step, GSA.numb, Species, Quantity)][dup == FALSE, ref := NA]

输出:

   id Country Year Time.step GSA.numb Species Quantity   dup  ref
1:  1     ESP 1965     Month     GSA5  Mullus      500 FALSE <NA>
2:  2     ESP 1965     Month     GSA5  Mullus      200 FALSE <NA>
3:  3     ESP 1965     Month     GSA5  Mullus      200  TRUE  id2
4:  4     ITA 1965     Month    GSA17 Eledone      350 FALSE <NA>
5:  5     ITA 1965     Month    GSA17 Eledone      350  TRUE  id4
6:  6     ITA 1965     Month    GSA17 Eledone      125 FALSE <NA>

一种tidyverse方法(之前已经创建了dup

library(tidyverse)

df %>% 
  group_by_at(vars(2:7)) %>% 
  mutate(ref = ifelse(dup, paste0("id", first(id)), NA_character_))

输出:

     id Country  Year Time.step GSA.numb Species Quantity dup   ref  
  <int> <chr>   <int> <chr>     <chr>    <chr>      <int> <lgl> <chr>
1     1 ESP      1965 Month     GSA5     Mullus       500 FALSE NA   
2     2 ESP      1965 Month     GSA5     Mullus       200 FALSE NA   
3     3 ESP      1965 Month     GSA5     Mullus       200 TRUE  id2  
4     4 ITA      1965 Month     GSA17    Eledone      350 FALSE NA   
5     5 ITA      1965 Month     GSA17    Eledone      350 TRUE  id4  
6     6 ITA      1965 Month     GSA17    Eledone      125 FALSE NA

如果您要在语句中创建dup列:

df %>% 
  group_by_at(vars(2:7)) %>% 
  mutate(
    dup = row_number() > 1,
    ref = ifelse(dup, paste0("id", first(id)), NA_character_))

输出:

     id Country  Year Time.step GSA.numb Species Quantity dup   ref  
  <int> <chr>   <int> <chr>     <chr>    <chr>      <int> <lgl> <chr>
1     1 ESP      1965 Month     GSA5     Mullus       500 FALSE NA   
2     2 ESP      1965 Month     GSA5     Mullus       200 FALSE NA   
3     3 ESP      1965 Month     GSA5     Mullus       200 TRUE  id2  
4     4 ITA      1965 Month     GSA17    Eledone      350 FALSE NA   
5     5 ITA      1965 Month     GSA17    Eledone      350 TRUE  id4  
6     6 ITA      1965 Month     GSA17    Eledone      125 FALSE NA 

答案 1 :(得分:2)

您可以使用take函数来快速标识重复项

tidyverse

答案 2 :(得分:0)

此示例使用基数R并将找到的重复项与原始值进行匹配。如果单行也有多个重复项,这将很有帮助。

示例数据(使用dput(control = NULL),因此字符/因数已转换为数字)

df <- data.frame(id = c(1, 1, 1, 2, 2, 2), 
           Country = c(1965, 1965, 1965, 1965, 1965, 1965), 
           Year = c(1, 1, 1, 1, 1, 1), 
           Time.step = c(1, 1, 1, 1, 1, 1), 
           GSA.numb = c(5, 5, 5, 17, 17, 17), 
           Species = c(2, 2, 2, 1, 1, 1), Quantity = c(500, 200, 200, 350, 350, 125))

代码是矢量化的,因此,尽管有外部循环,它仍应在大型数据帧上相当快地运行。

df$dup <- duplicated(df)
dupes <- df[df$dup,]
df$ref <- NA # initialize 
for(i in 1:nrow(dupes)){
  z=which(df[,1] == dupes[i,1]&
          df[,2] == dupes[i,2]&
          df[,3] == dupes[i,3]&
          df[,4] == dupes[i,4]&
          df[,5] == dupes[i,5]&
          df[,6] == dupes[i,6]&
          df[,7] == dupes[i,7]) # make sure not to include that $dup column!
  df$ref[z[-1]] <- paste0("=id",min(z))
}
df
#  id Country Year Time.step GSA.numb Species Quantity   dup  ref
#1  1    1965    1         1        5       2      500 FALSE <NA>
#2  1    1965    1         1        5       2      200 FALSE <NA>
#3  1    1965    1         1        5       2      200  TRUE =id2
#4  2    1965    1         1       17       1      350 FALSE <NA>
#5  2    1965    1         1       17       1      350  TRUE =id4
#6  2    1965    1         1       17       1      125 FALSE <NA>

即使您可以通过应用功能来加强此功能,也可以更快地运行。

答案 3 :(得分:0)

使用<input type="text" placeholder="My Cool Placeholder Text">

tidyverse