在R中保留较少的最近重复行

时间:2018-12-28 14:57:00

标签: r dataset

因此,我有一个包含帐单号,日,月,年和汇总值的数据集。有很多重复的公牛编号,我想保留第一个。如果有相同的日期,月份和年份的重复项,我希望保留总计最高金额的重复项。

例如,如果数据集现在看起来像这样:

Bill Number   Day   Month    Year   Ag. Value
   1           10     4       1998     10
   1           11     4       1998     14
   2           23     11      2001     12
   2           23     11      2001     9
   3           11     3       2005     8
   3           12     3       2005     9
   3           13     3       2005     4

我希望结果看起来像这样:

Bill Number  Day  Month  Year  Ag. Value
    1         10    4     1998    10
    2         23    11    2001    12
    3         11    3     2005    8

我不确定是否有可以使用的命令,仅介绍所有这些参数,还是应该分阶段执行,但是无论哪种方式,我都不确定如何开始。我使用了duplicate()unique(),然后卡住了。

谢谢!

4 个答案:

答案 0 :(得分:3)

library( data.table )

dt <- fread("Bill_Number   Day   Month    Year   Ag_Value
1           10     4       1998     10
1           11     4       1998     14
2           23     11      2001     12
2           23     11      2001     9
3           11     3       2005     8
3           12     3       2005     9
3           13     3       2005     4", header = TRUE)

dt[ !duplicated( Bill_Number), ]  

#    Bill_Number Day Month Year Ag_Value
# 1:           1  10     4 1998       10
# 2:           2  23    11 2001       12
# 3:           3  11     3 2005        8

dt[, .SD[1], by = .(Bill_Number) ]  #other approach, a bit slower

答案 1 :(得分:2)

duplicated()提供的条目与较早的条目(即下标较小的条目)相同。因此,按日期(最早到最早)对帐单号进行排序,然后删除重复项应该可以解决问题。将您的日,月和年列汇总为一个日期列可能会有所帮助。

答案 2 :(得分:0)

此答案使用dplyr程序包并满足您的条件:“ 如果有相同的日期,月份和年份的重复项,我希望保留总金额最高的重复项。

library(data.table)
library(dplyr)

myData <- fread("Bill_Number   Day   Month    Year   Ag_Value
        1           10     4       1998     10
        1           11     4       1998     14
        2           23     11      2001     12
        2           23     11      2001     9
        3           11     3       2005     8
        3           12     3       2005     9
        3           13     3       2005     4", header = TRUE)

myData <- as.tibble(myData) #tibble form
sData <- arrange(myData, Bill_Number, Year, Month, Day, desc(Ag_Value)) #sort the data with the required manner 
fData <- distinct(sData, Bill_Number, .keep_all = 1) #final data
fData
# A tibble: 3 x 5
  Bill_Number   Day Month  Year Ag_Value
       <int> <int> <int> <int>    <int>
1           1    10     4  1998       10
2           2    23    11  2001       12
3           3    11     3  2005        8

答案 3 :(得分:0)

我使用了一些循环和条件检查,并尝试了除您提到的“基本”设置以外的测试设置。

library(tidyverse)

#base dataset
billNumber <- c(1,1,2,2,3,3,3)
day <- c(10,11,23,23,11,12,13)
month <- c(4,4,11,11,3,3,3)
year <- c(1998,1998,2001,2001,2005,2005,2005)
agValue <- c(10,14,12,9,8,9,4)

#test dataset
billNumber <- c(1,1,2,2,3,3,3,4,4,4)
day <- c(10,11,23,23,11,12,13,15,15,15)
month <- c(4,4,11,11,3,3,3,6,6,6)
year <- c(1998,1998,2001,2001,2005,2005,2005,2020,2020,2020)
agValue <- c(10,14,9,12,8,9,4,13,15,8)

#build the dataset
df <- data.frame(billNumber,day,month,year,agValue)

#add a couple of working columns
df_full <- df %>%
  mutate(
    concat = paste(df$billNumber,df$day,df$month,df$year,sep="-"),
    flag = ""
  )

df_full

billNumber day month year agValue       concat flag
1          1  10     4 1998      10  1-10-4-1998     
2          1  11     4 1998      14  1-11-4-1998     
3          2  23    11 2001      12 2-23-11-2001     
4          2  23    11 2001       9 2-23-11-2001     
5          3  11     3 2005       8  3-11-3-2005     
6          3  12     3 2005       9  3-12-3-2005     
7          3  13     3 2005       4  3-13-3-2005     

#separate records with one/multi occurence as defined in the question
row_single <- df_full %>% count(concat) %>% filter(n == 1)
df_full_single <- df_full[df_full$concat %in% row_single$concat,]

row_multi <- df_full %>% count(concat) %>% filter(n > 1)
df_full_multi <- df_full[df_full$concat %in% row_multi$concat,]

#flag the rows with single occurence
df_full_single[1,]$flag = "Y"

for (row in 2:nrow(df_full_single)) {

  if (df_full_single[row,]$billNumber == df_full_single[row-1,]$billNumber) {
    df_full_single[row,]$flag = "N"    
  } else 
  {
    df_full_single[row,]$flag = "Y"
  }
}

df_full_single


#flag the rows with multi occurences
df_full_multi[1,]$flag = "Y"

for (row in 2:nrow(df_full_multi)) {

  if (
      (df_full_multi[row,]$billNumber == df_full_multi[row-1,]$billNumber) &
      (df_full_multi[row,]$agValue > df_full_multi[row-1,]$agValue)
     ) {
    df_full_multi[row,]$flag = "Y"    
    df_full_multi[row-1,]$flag = "N"
  } else 
  {
    df_full_multi[row,]$flag = "N"
  }
}

df_full_multi

#rebuild full dataset and retrieve the desired output

df_full_final <- rbind(df_full_single,df_full_multi)

df_full_final <- df_full_final[df_full_final$flag == "Y",c(1,2,3,4,5)]

df_full_final <- df_full_final[order(df_full_final$billNumber),]

df_full_final

billNumber day month year agValue
1          1  10     4 1998      10
3          2  23    11 2001      12
5          3  11     3 2005       8