dplyr / dt汇总列是否为空/ NA并粘贴?

时间:2017-09-18 19:40:24

标签: r dplyr data.table

我的数据是:

Name        House   Street      Apt City    Postal  Phone
DUMA PAUL   2030    GREEN ROAD      DESERT  Z0K2K1  999-577-3789
DUNN S              GREEN ROAD      DESERT  Z0K2K1  999-577-3256
FERGUSON BOB        GREEN ROAD      DESERT  Z0K2K1  999-577-3771
FITSCHEN A  3989    GREEN ROAD      DESERT  Z0K2K1  999-577-3557
BLACK CARY  2079    GREEN ROAD      DESERT  Z0K2K1  999-577-3779
BLACK RUTH  2079    GREEN ROAD      DESERT  Z0K2K1  999-577-3779

我正在尝试比较Names(动态,数据按House排序),如果相等AND house#相等,则用“OR”连接相应的两个电话号码并删除未连接的行并连接名称用“AND”

我正在使用:

 data <- data %>%
    group_by(House, Street, Apt, City, Postal) %>%
    summarise(Name = first(paste(Name, collapse = ", AND ")), Phone = 
    paste(unique(Phone), collapse = " OR ")) %>%
    ungroup() %>%
    arrange(Street, desc(House)) %>%
    select(colnames(dataset)) %>%
    filter(!Phone %in% dnc$`Home Phone`)

问题:使用上面的dplyr,如果House是NA(或空白,我使我的NA为空白)并且Apt是NA(或“”),我会连接,而我不会想要。所以使用上面的代码,我会有

  Name                      House   Street  Apt City    Postal  Phone
  DUNN S, AND FERGUSON BOB      GREEN ROAD      DESERT  Z0K2K1  9995773256 
  OR 9995773772
  DUMAS PAUL                2030    GREEN ROAD   DESERT Z0K2K1  
  9995773789
  BLACK CARY, AND BLACK RUTH 2079   GREEN ROAD   DESERT Z0K2K1  
  9995773779
  FITSCHEN A                 3989   GREEN ROAD   DESERT Z0K2K1  
  9995773556

有了上述内容,请注意DUNN S,和FERGUSON BOB现在在一起。我不希望这样。

dput(抱歉,如果没有帮助):

  list(structure(list(X__1 = c(NA, NA, NA, NA, NA, NA), Name = c("DUMAS 
   PAUL", 
   "DUNN S", "FERGUSON BOB", "FITSCHEN A", "BLACK CARY", "BLACK RUTH"
   ), House = c("2030", NA, NA, "3989", "2079", "2079"), Street = c("GREEN 
   ROAD", 
   "GREEN ROAD", "GREEN ROAD", "GREEN ROAD", "GREEN ROAD", "GREEN ROAD"
   ), Apt = c(NA, NA, NA, NA, NA, NA), City = c("DESERT", "DESERT", 
   "DESERT", "DESERT", "DESERT", "DESERT"), Prov = c("ZK", "ZK", 
   "ZK", "ZK", "ZK", "ZK"), Postal = c("Z0K2K1", "Z0K2K1", "Z0K2K1", 
   "Z0K2K1", "Z0K2K1", "Z0K2K1"), Phone = c("999-577-3789", "999-577-3256", 
    "999-577-3772", "999-577-3556", "999-577-3779", "999-577-3779"
    ), `Last Appear Date` = c(NA, NA, NA, NA, NA, NA)), .Names = c("X__1", 
    "Name", "House", "Street", "Apt", "City", "Prov", "Postal", "Phone", 
    "Last Appear Date"), class = c("tbl_df", "tbl", "data.frame"), row.names 
     = c(NA, 
    -6L)))

由于

2 个答案:

答案 0 :(得分:3)

DT[, {...}, by=]内,你几乎可以写任何东西。在这种情况下,if... else有效:

library(data.table)
library(magrittr)
DT = as.data.table(data)

DT[, 
  if (!(is.na(House) & is.na(Apt))) 
    .(
      Name = Name %>% paste(collapse = ", AND "), 
      Phone = Phone %>% unique %>% paste(collapse = " OR ")
    )
  else
    .(Name, Phone)
, by=.(House, Street, Apt, City, Postal)]

   House          Street Apt   City Postal                       Name        Phone
1:  2030 GREEN \n   ROAD  NA DESERT Z0K2K1            DUMAS \n   PAUL 999-577-3789
2:    NA      GREEN ROAD  NA DESERT Z0K2K1                     DUNN S 999-577-3256
3:    NA      GREEN ROAD  NA DESERT Z0K2K1               FERGUSON BOB 999-577-3772
4:  3989      GREEN ROAD  NA DESERT Z0K2K1                 FITSCHEN A 999-577-3556
5:  2079      GREEN ROAD  NA DESERT Z0K2K1 BLACK CARY, AND BLACK RUTH 999-577-3779

可能与dplyr::do类似。

你不必在这里使用magrittr;这只是我对paste部分的偏好。您可能还想在这些管道中添加%>% sort步骤(因此电话和名称列表总是在升序中)。

答案 1 :(得分:0)

我猜这个问题没有“漂亮”的解决方案,这是一个不适合dplyr工作流程的处理。一种解决方法是以某种方式识别具有空数据的房屋。这样,它们就不会组合在一起。一种方法是在<cfdump var="#signinfo#"/>为空时放置“#row_number”。现在它们不会组合在一起,因为每个空行都有不同的数字。处理完毕后,您只需使用空字符串或House替换以#开头的值。

NA