我的数据是:
Name House Street Apt City Postal Phone
DUMA PAUL 2030 GREEN ROAD DESERT Z0K2K1 999-577-3789
DUNN S GREEN ROAD DESERT Z0K2K1 999-577-3256
FERGUSON BOB GREEN ROAD DESERT Z0K2K1 999-577-3771
FITSCHEN A 3989 GREEN ROAD DESERT Z0K2K1 999-577-3557
BLACK CARY 2079 GREEN ROAD DESERT Z0K2K1 999-577-3779
BLACK RUTH 2079 GREEN ROAD DESERT Z0K2K1 999-577-3779
我正在尝试比较Names(动态,数据按House排序),如果相等AND house#相等,则用“OR”连接相应的两个电话号码并删除未连接的行并连接名称用“AND”
我正在使用:
data <- data %>%
group_by(House, Street, Apt, City, Postal) %>%
summarise(Name = first(paste(Name, collapse = ", AND ")), Phone =
paste(unique(Phone), collapse = " OR ")) %>%
ungroup() %>%
arrange(Street, desc(House)) %>%
select(colnames(dataset)) %>%
filter(!Phone %in% dnc$`Home Phone`)
问题:使用上面的dplyr,如果House是NA(或空白,我使我的NA为空白)并且Apt是NA(或“”),我会连接,而我不会想要。所以使用上面的代码,我会有
Name House Street Apt City Postal Phone
DUNN S, AND FERGUSON BOB GREEN ROAD DESERT Z0K2K1 9995773256
OR 9995773772
DUMAS PAUL 2030 GREEN ROAD DESERT Z0K2K1
9995773789
BLACK CARY, AND BLACK RUTH 2079 GREEN ROAD DESERT Z0K2K1
9995773779
FITSCHEN A 3989 GREEN ROAD DESERT Z0K2K1
9995773556
有了上述内容,请注意DUNN S,和FERGUSON BOB现在在一起。我不希望这样。
dput(抱歉,如果没有帮助):
list(structure(list(X__1 = c(NA, NA, NA, NA, NA, NA), Name = c("DUMAS
PAUL",
"DUNN S", "FERGUSON BOB", "FITSCHEN A", "BLACK CARY", "BLACK RUTH"
), House = c("2030", NA, NA, "3989", "2079", "2079"), Street = c("GREEN
ROAD",
"GREEN ROAD", "GREEN ROAD", "GREEN ROAD", "GREEN ROAD", "GREEN ROAD"
), Apt = c(NA, NA, NA, NA, NA, NA), City = c("DESERT", "DESERT",
"DESERT", "DESERT", "DESERT", "DESERT"), Prov = c("ZK", "ZK",
"ZK", "ZK", "ZK", "ZK"), Postal = c("Z0K2K1", "Z0K2K1", "Z0K2K1",
"Z0K2K1", "Z0K2K1", "Z0K2K1"), Phone = c("999-577-3789", "999-577-3256",
"999-577-3772", "999-577-3556", "999-577-3779", "999-577-3779"
), `Last Appear Date` = c(NA, NA, NA, NA, NA, NA)), .Names = c("X__1",
"Name", "House", "Street", "Apt", "City", "Prov", "Postal", "Phone",
"Last Appear Date"), class = c("tbl_df", "tbl", "data.frame"), row.names
= c(NA,
-6L)))
由于
答案 0 :(得分:3)
在DT[, {...}, by=]
内,你几乎可以写任何东西。在这种情况下,if... else
有效:
library(data.table)
library(magrittr)
DT = as.data.table(data)
DT[,
if (!(is.na(House) & is.na(Apt)))
.(
Name = Name %>% paste(collapse = ", AND "),
Phone = Phone %>% unique %>% paste(collapse = " OR ")
)
else
.(Name, Phone)
, by=.(House, Street, Apt, City, Postal)]
House Street Apt City Postal Name Phone
1: 2030 GREEN \n ROAD NA DESERT Z0K2K1 DUMAS \n PAUL 999-577-3789
2: NA GREEN ROAD NA DESERT Z0K2K1 DUNN S 999-577-3256
3: NA GREEN ROAD NA DESERT Z0K2K1 FERGUSON BOB 999-577-3772
4: 3989 GREEN ROAD NA DESERT Z0K2K1 FITSCHEN A 999-577-3556
5: 2079 GREEN ROAD NA DESERT Z0K2K1 BLACK CARY, AND BLACK RUTH 999-577-3779
可能与dplyr::do
类似。
你不必在这里使用magrittr;这只是我对paste
部分的偏好。您可能还想在这些管道中添加%>% sort
步骤(因此电话和名称列表总是在升序中)。
答案 1 :(得分:0)
我猜这个问题没有“漂亮”的解决方案,这是一个不适合dplyr工作流程的处理。一种解决方法是以某种方式识别具有空数据的房屋。这样,它们就不会组合在一起。一种方法是在<cfdump var="#signinfo#"/>
为空时放置“#row_number”。现在它们不会组合在一起,因为每个空行都有不同的数字。处理完毕后,您只需使用空字符串或House
替换以#
开头的值。
NA