R使用distinct()丢失数据

时间:2019-03-19 20:23:35

标签: r distinct

使用distinct来删除组合数据集中的重复项,但是我会丢失数据,因为distinct仅保留第一个条目。

示例数据帧“ a”

TRIGGER `Customer_Details`.`Client_Account_Payback_AFTER_INSERT` AFTER INSERT ON `Client_Account_Payback` FOR EACH ROW
BEGIN


    declare amountPaid float;
    declare amountRamianing float;
    declare loanAmount float;

    select new.Client_Account_Payback_Amount into amountPaid;

    select Client_Account_Borrow_Amount_Remaining 
    from Client_Account_Borrow
    where Client_Account_Borrow_ID = new.Client_Account_Payback_Loan_Borrowed_ID
    into amountRamianing;

    select Client_Account_Amount_Borrowed
    from Client_Account_Borrow
    where Client_Account_Borrow_ID = new.Client_Account_Payback_Loan_Borrowed_ID
    into loanAmount;


    set amountRamianing = amountRamianing + amountPaid;

    IF amountRamianing > loanAmount THEN 
        UPDATE `Client_Account_Borrow`  
        SET `Client_Account_Borrow_Amount_Remaining` = amountRamianing 
        WHERE `Client_Account_Borrow_ID` = new.Client_Account_Payback_Loan_Borrowed_ID; 
    ELSE 
        UPDATE `Client_Account_Borrow` 
        SET `Client_Account_Borrow_Amount_Remaining` = amountRamianing, 
        `Client_Account_Borrow_Paid_Back` = true 
        WHERE `Client_Account_Borrow_ID` = new.Client_Account_Payback_Loan_Borrowed_ID; 
    END IF;
END

编码:

 SiteID PYear   Habitat num.1
000901W 2011    W   NA
001101W 2007    W   NA
001801W 2005    W   NA
002001W 2017    W   NA
002401F 2006    F   NA
002401F 2016    F   NA
004001F 2006    F   NA
004001W 2006    W   NA
004101W 2007    W   NA
004101W 2007    W   16
004701F 2017    F   NA
006201F 2008    F   NA
006501F 2009    F   NA
006601W 2007    W   2
006601W 2007    W   NA
006803F 2009    F   NA
007310F 2018    F   NA
007602W 2017    W   NA
008103W 2011    W   NA
008203F 2007    F   1

我想知道如何根据SiteID和num.1删除重复项,但是我不想摆脱num.1列中具有数字值的重复项。例如,在数据帧中,004101W和006601W有多个条目,但是我想保留整数而不是NA。

1 个答案:

答案 0 :(得分:0)

(感谢您使用更多具有代表性的示例数据进行更新!)

a现在有20行,具有17个不同的SiteID值。

这些SiteID中的三个有多行:

library(tidyverse)
a %>% 
  add_count(SiteID) %>%
  filter(n > 1)

## A tibble: 6 x 5
#  SiteID  PYear Habitat num.1     n
#  <chr>   <int> <chr>   <int> <int>
#1 002401F  2006 F          NA     2    # Both have NA for num.1
#2 002401F  2016 F          NA     2    #  ""

#3 004101W  2007 W          NA     2    # Drop 
#4 004101W  2007 W          16     2    # Keep this one

#5 006601W  2007 W           2     2    # Keep this one
#6 006601W  2007 W          NA     2    # Drop

如果我们想对num.1中没有NA的行进行优先级排序,我们可以在每个SiteID中以arrange的数量加1,这样,对于每个SiteID,NA都排在最后,distinct函数将使用非NA值对数字1进行优先级排序。

(如果您想保留a中的原始排序,但仍将编号1中的NA值移到末尾,则也提供了另一种选择。在is.na(num.1)项中,NA将评估为TRUE,并紧随提供的值之后,该值的值为FALSE。)

a %>% 
  arrange(SiteID, num.1) %>%
  #arrange(SiteID, is.na(num.1)) %>%    # Alternative to preserve orig order
  distinct(SiteID, .keep_all = TRUE)

    SiteID PYear Habitat num.1
1  000901W  2011       W    NA
2  001101W  2007       W    NA
3  001801W  2005       W    NA
4  002001W  2017       W    NA
5  002401F  2006       F    NA     # Kept first appearing row, since both NA num.1
6  004001F  2006       F    NA
7  004001W  2006       W    NA
8  004101W  2007       W    16     # Kept non-NA row
9  004701F  2017       F    NA
10 006201F  2008       F    NA
11 006501F  2009       F    NA
12 006601W  2007       W     2     # Kept non-NA row
13 006803F  2009       F    NA
14 007310F  2018       F    NA
15 007602W  2017       W    NA
16 008103W  2011       W    NA
17 008203F  2007       F     1

导入示例数据

a <- read.table(header = T, stringsAsFactors = F,
  text = " SiteID PYear   Habitat num.1
000901W 2011    W   NA
001101W 2007    W   NA
001801W 2005    W   NA
002001W 2017    W   NA
002401F 2006    F   NA
002401F 2016    F   NA
004001F 2006    F   NA
004001W 2006    W   NA
004101W 2007    W   NA
004101W 2007    W   16
004701F 2017    F   NA
006201F 2008    F   NA
006501F 2009    F   NA
006601W 2007    W   2
006601W 2007    W   NA
006803F 2009    F   NA
007310F 2018    F   NA
007602W 2017    W   NA
008103W 2011    W   NA
008203F 2007    F   1")