比较下一行中的字符

时间:2017-09-14 07:05:27

标签: r dataframe

我的数据是:

Name    House   Street  Apt City    Postal  Phone
Bob Joe     954 BLUE DRIVE  NA  A PLACE Z5K4N2  999-495-6544
Smith Jane  555 BLUE DRIVE  NA  A PLACE Z5K4N5  999-435-6172
Smith Jane  555 BLUE DRIVE  NA  A PLACE Z5K4N5  999-450-6763

我正在尝试比较Names(动态,数据按House排序),如果相等AND house#相等,则连接相应的两个电话号码并删除未连接的行。

所以看起来像这样:

 Name   House   Street      Apt City    Postal  Phone
Bob Joe     954 BLUE DRIVE  NA  A PLACE Z5K4N2  999-495-6544
Smith Jane  555 BLUE DRIVE  NA  A PLACE Z5K4N5  999-435-6172 OR 999-450-6763    

我的尝试:

for(x in 1:nrow(data)) {

     if(data$Name[x] == data$Name[x+1]) {
     data$NameDupes <- data$Name[x] }
 }

然后使用

aggregate: aggregate(Phone ~ Name + Street + City + Postal + Apt + House, data = df, paste, collapse = " OR ")

然后在我原来的df上使用连接。

对想法持开放态度

由于

2 个答案:

答案 0 :(得分:2)

来自dplyr的解决方案。

library(dplyr)

dt2 <- dt %>%
  group_by(House, Street, Apt, City, Postal) %>%
  summarise(Name = first(Name), Phone = paste(Phone, collapse = " OR ")) %>%
  ungroup() %>%
  arrange(desc(House)) %>%
  select(colnames(dt))
dt2
# A tibble: 2 x 7
        Name House     Street   Apt    City Postal                        Phone
       <chr> <int>      <chr> <lgl>   <chr>  <chr>                        <chr>
1    Bob Joe   954 BLUE DRIVE    NA A PLACE Z5K4N2                 999-495-6544
2 Smith Jane   555 BLUE DRIVE    NA A PLACE Z5K4N5 999-435-6172 OR 999-450-6763

数据

dt <- read.table(text = "Name    House   Street  Apt City    Postal  Phone
'Bob Joe'     954 'BLUE DRIVE'  NA  'A PLACE' Z5K4N2  '999-495-6544'
'Smith Jane'  555 'BLUE DRIVE'  NA  'A PLACE' Z5K4N5  '999-435-6172'
'Smith Jane'  555 'BLUE DRIVE'  NA  'A PLACE' Z5K4N5  '999-450-6763'",
header = TRUE, stringsAsFactors = FALSE)

答案 1 :(得分:0)

与使用data.table的@ycw ...不同的答案。 (因为我是该软件包的个人粉丝)。

使用数据

dt <- read.table(text = "Name    House   Street  Apt City    Postal  Phone
'Bob Joe'     954 'BLUE DRIVE'  NA  'A PLACE' Z5K4N2  '999-495-6544'
'Smith Jane'  555 'BLUE DRIVE'  NA  'A PLACE' Z5K4N5  '999-435-6172'
'Smith Jane'  555 'BLUE DRIVE'  NA  'A PLACE' Z5K4N5  '999-450-6763'",
header = TRUE, stringsAsFactors = FALSE)

我们执行一个伟大的单行

library(data.table)
dt = as.data.table(dt)
dt[,.(Phone = paste(Phone,collapse = " OR ")),by = .(Name,House,Street,Apt,City,Postal)]

输出

     Name House     Street Apt    City Postal                        Phone
1:    Bob Joe   954 BLUE DRIVE  NA A PLACE Z5K4N2                 999-495-6544
2: Smith Jane   555 BLUE DRIVE  NA A PLACE Z5K4N5 999-435-6172 OR 999-450-6763