因此,我有一个包含帐单号,日,月,年和汇总值的数据集。有很多重复的公牛编号,我想保留第一个。如果有相同的日期,月份和年份的重复项,我希望保留总计最高金额的重复项。
例如,如果数据集现在看起来像这样:
Bill Number Day Month Year Ag. Value
1 10 4 1998 10
1 11 4 1998 14
2 23 11 2001 12
2 23 11 2001 9
3 11 3 2005 8
3 12 3 2005 9
3 13 3 2005 4
我希望结果看起来像这样:
Bill Number Day Month Year Ag. Value
1 10 4 1998 10
2 23 11 2001 12
3 11 3 2005 8
我不确定是否有可以使用的命令,仅介绍所有这些参数,还是应该分阶段执行,但是无论哪种方式,我都不确定如何开始。我使用了duplicate()
和unique()
,然后卡住了。
谢谢!
答案 0 :(得分:3)
library( data.table )
dt <- fread("Bill_Number Day Month Year Ag_Value
1 10 4 1998 10
1 11 4 1998 14
2 23 11 2001 12
2 23 11 2001 9
3 11 3 2005 8
3 12 3 2005 9
3 13 3 2005 4", header = TRUE)
dt[ !duplicated( Bill_Number), ]
# Bill_Number Day Month Year Ag_Value
# 1: 1 10 4 1998 10
# 2: 2 23 11 2001 12
# 3: 3 11 3 2005 8
或
dt[, .SD[1], by = .(Bill_Number) ] #other approach, a bit slower
答案 1 :(得分:2)
duplicated()提供的条目与较早的条目(即下标较小的条目)相同。因此,按日期(最早到最早)对帐单号进行排序,然后删除重复项应该可以解决问题。将您的日,月和年列汇总为一个日期列可能会有所帮助。
答案 2 :(得分:0)
此答案使用dplyr
程序包并满足您的条件:“ 如果有相同的日期,月份和年份的重复项,我希望保留总金额最高的重复项。 ”
library(data.table)
library(dplyr)
myData <- fread("Bill_Number Day Month Year Ag_Value
1 10 4 1998 10
1 11 4 1998 14
2 23 11 2001 12
2 23 11 2001 9
3 11 3 2005 8
3 12 3 2005 9
3 13 3 2005 4", header = TRUE)
myData <- as.tibble(myData) #tibble form
sData <- arrange(myData, Bill_Number, Year, Month, Day, desc(Ag_Value)) #sort the data with the required manner
fData <- distinct(sData, Bill_Number, .keep_all = 1) #final data
fData
# A tibble: 3 x 5
Bill_Number Day Month Year Ag_Value
<int> <int> <int> <int> <int>
1 1 10 4 1998 10
2 2 23 11 2001 12
3 3 11 3 2005 8
答案 3 :(得分:0)
我使用了一些循环和条件检查,并尝试了除您提到的“基本”设置以外的测试设置。
library(tidyverse)
#base dataset
billNumber <- c(1,1,2,2,3,3,3)
day <- c(10,11,23,23,11,12,13)
month <- c(4,4,11,11,3,3,3)
year <- c(1998,1998,2001,2001,2005,2005,2005)
agValue <- c(10,14,12,9,8,9,4)
#test dataset
billNumber <- c(1,1,2,2,3,3,3,4,4,4)
day <- c(10,11,23,23,11,12,13,15,15,15)
month <- c(4,4,11,11,3,3,3,6,6,6)
year <- c(1998,1998,2001,2001,2005,2005,2005,2020,2020,2020)
agValue <- c(10,14,9,12,8,9,4,13,15,8)
#build the dataset
df <- data.frame(billNumber,day,month,year,agValue)
#add a couple of working columns
df_full <- df %>%
mutate(
concat = paste(df$billNumber,df$day,df$month,df$year,sep="-"),
flag = ""
)
df_full
billNumber day month year agValue concat flag
1 1 10 4 1998 10 1-10-4-1998
2 1 11 4 1998 14 1-11-4-1998
3 2 23 11 2001 12 2-23-11-2001
4 2 23 11 2001 9 2-23-11-2001
5 3 11 3 2005 8 3-11-3-2005
6 3 12 3 2005 9 3-12-3-2005
7 3 13 3 2005 4 3-13-3-2005
#separate records with one/multi occurence as defined in the question
row_single <- df_full %>% count(concat) %>% filter(n == 1)
df_full_single <- df_full[df_full$concat %in% row_single$concat,]
row_multi <- df_full %>% count(concat) %>% filter(n > 1)
df_full_multi <- df_full[df_full$concat %in% row_multi$concat,]
#flag the rows with single occurence
df_full_single[1,]$flag = "Y"
for (row in 2:nrow(df_full_single)) {
if (df_full_single[row,]$billNumber == df_full_single[row-1,]$billNumber) {
df_full_single[row,]$flag = "N"
} else
{
df_full_single[row,]$flag = "Y"
}
}
df_full_single
#flag the rows with multi occurences
df_full_multi[1,]$flag = "Y"
for (row in 2:nrow(df_full_multi)) {
if (
(df_full_multi[row,]$billNumber == df_full_multi[row-1,]$billNumber) &
(df_full_multi[row,]$agValue > df_full_multi[row-1,]$agValue)
) {
df_full_multi[row,]$flag = "Y"
df_full_multi[row-1,]$flag = "N"
} else
{
df_full_multi[row,]$flag = "N"
}
}
df_full_multi
#rebuild full dataset and retrieve the desired output
df_full_final <- rbind(df_full_single,df_full_multi)
df_full_final <- df_full_final[df_full_final$flag == "Y",c(1,2,3,4,5)]
df_full_final <- df_full_final[order(df_full_final$billNumber),]
df_full_final
billNumber day month year agValue
1 1 10 4 1998 10
3 2 23 11 2001 12
5 3 11 3 2005 8