比较数据框列

时间:2018-02-20 09:20:12

标签: r dataframe

我有一个带有重复ID的数据框,看起来像这样

+-----+------+------------------+
| ID  + Name + other columns....|
+-----+------+------------------+
|  1  + AAA  +                  |
|  1  + BBB  +                  |   
|  2  + ABA  +                  |
|  2  + ACA  +                  |
|  2  + CCC  +                  |
|  3  + DDD  +                  |
|  4  + EEE  +                  |
|  4  + EEE  +                  |
|  4  + FFF  +                  |
|  .  +      +                  |
+-----+------+------------------+

我想找到名称列中具有不同值的重复ID。 我可以找到重复的ID,但我想比较列" Name"在基于相等ID的相同数据帧中。

3 个答案:

答案 0 :(得分:1)

以下是使用dplyr的解决方案。

library(dplyr)
    df %>%
            group_by(ID) %>% 
            filter(n() > 1) %>% # select only duplicated rows
            mutate(Unique_Name = n_distinct(Name)) %>%  # number of distinct Name values
            filter(Unique_Name != 1)  # select rows that have not unique Name values
    # or just
    df %>%
            group_by(ID) %>% 
            filter(n() > 1) %>% # select only duplicated rows
            filter(n_distinct(Name) != 1)  # select rows that have not unique Name values

# Data
df <- structure(list(ID = c(1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L), Name = structure(c(1L, 
4L, 2L, 3L, 5L, 6L, 7L, 7L), .Label = c("AAA", "ABA", "ACA", 
"BBB", "CCC", "DDD", "EEE"), class = "factor")), .Names = c("ID", 
"Name"), class = "data.frame", row.names = c(NA, -8L))

答案 1 :(得分:0)

我们可以尝试

names(which(rowSums(table(df1[1:2]) != 0) == 1))

目前尚不清楚逻辑是否要找到所有unique'名称'的ID。如果是这种情况

library(dplyr)
df1 %>%
  group_by(ID) %>% 
  filter(n_distinct(Name)== n()) %>%
  pull(ID) %>%
  unique

答案 2 :(得分:0)

这将为您提供一个新列,其中TRUE是具有重复ID和相同名称的行:

df=tibble(ID=c(1,1,2,2,2,3,4,4,4),Name=c("AAA","BBB","ABA","ACA","CCC","DDD","EEE","EEE","FFF"))
df0=df%>%group_by(ID)%>%mutate(x=duplicated(Name))

您当前的df仅在第8行中显示为True(ID == 4&amp; Name == EEE)

 ID Name  x    
  <dbl> <chr> <lgl>
1  1.00 AAA   F    
2  1.00 BBB   F    
3  2.00 ABA   F    
4  2.00 ACA   F    
5  2.00 CCC   F    
6  3.00 ABA   F    
7  4.00 EEE   F    
8  4.00 EEE   T    
9  4.00 FFF   F    

如果您将df更改为具有相同ID的其他匹配名称(&#39; ABA&#39;):

df=tibble(ID=c(1,1,2,2,2,3,4,4,4),Name=c("AAA","BBB","ABA","ABA","CCC","DDD","EEE","EEE","FFF"))

你将得到更多的真理:

ID Name  x    
  <dbl> <chr> <lgl>
1  1.00 AAA   F    
2  1.00 BBB   F    
3  2.00 ABA   F    
4  2.00 ABA   T    
5  2.00 CCC   F    
6  3.00 DDD   F    
7  4.00 EEE   F    
8  4.00 EEE   T    
9  4.00 FFF   F    

但是,如果差异ID显示相同的名称:

df=tibble(ID=c(1,1,2,2,2,3,4,4,4),Name=c("AAA","BBB","ABA","ACA","CCC","ABA","EEE","EEE","FFF"))

没有新的匹配:

ID Name  x    
  <dbl> <chr> <lgl>
1  1.00 AAA   F    
2  1.00 BBB   F    
3  2.00 ABA   F    
4  2.00 ACA   F    
5  2.00 CCC   F    
6  3.00 ABA   F    
7  4.00 EEE   F    
8  4.00 EEE   T    
9  4.00 FFF   F