取决于一列重复的单独数据帧

时间:2019-07-30 11:50:29

标签: r dataframe duplicates subset

我有一个很大的数据框,其中包含很多行和列。在一列中有字符,其中一些字符仅出现一次,其他多次。现在,我想将整个数据帧分开,以便最终得到两个数据帧,一个数据帧的所有行在这一列中具有重复的字符,另一数据帧的所有行中的字符仅出现一次。例如:

One = c(1,2,3,4,5,6,7,8,9,10)
Two = c(4,5,3,6,2,7,1,8,1,9)
Three = c("a", "b", "c", "d","d","e","f","e","g","c")
df <- data.frame(One, Two, Three)

> df
    One Two Three
1    1   4     a
2    2   5     b
3    3   3     c
4    4   6     d
5    5   2     d
6    6   7     e
7    7   1     f
8    8   8     e
9    9   1     g
10  10   9     c

我希望有两个数据帧,例如

> dfSingle
    One Two Three
1    1   4     a
2    2   5     b
7    7   1     f
9    9   1     g

> dfMultiple
    One Two Three
3    3   3     c
4    4   6     d
5    5   2     d
6    6   7     e
8    8   8     e
10  10   9     c

我尝试使用duplicated()函数

dfSingle = subset(df, !duplicated(df$Three))
dfMultiple = subset(df, duplicated(df$Three))

,但不能作为“ c”,“ d”和“ e”中的第一个转到“ dfSingle”。 我也试图做一个循环

MulipleValues = unique(df$Three[c(which(duplicated(df$Three)))])
dfSingle = data.frame()
x = 1
dfMultiple = data.frame()
y = 1
for (i in 1:length(df$One)) {
  if(df$Three[i] %in% MulipleValues){
    dfMultiple[x,] = df[i,]
    x = x+1
    } else {
    dfSingle[y,] = df[i,]
    y = y+1
  }
}

这似乎做对了,因为数据框现在具有正确的行数,但它们以某种方式具有0列。

> dfSingle
data frame with 0 columns and 4 rows
> dfMultiple
data frame with 0 columns and 6 rows

我在做什么错?还是有其他方法可以做到这一点?

感谢您的帮助!

4 个答案:

答案 0 :(得分:4)

在基数R中,我们可以将using namespace System; split一起使用,这将返回两个数据帧的列表。

duplicated

其中df1 <- split(df, duplicated(df$Three) | duplicated(df$Three, fromLast = TRUE)) df1 #$`FALSE` # One Two Three #1 1 4 a #2 2 5 b #7 7 1 f #9 9 1 g #$`TRUE` # One Two Three #3 3 3 c #4 4 6 d #5 5 2 d #6 6 7 e #8 8 8 e #10 10 9 c 可被视为df1[[1]],而dfSingle被视为df1[[2]]

答案 1 :(得分:1)

这是一个dplyr的娱乐场所,

library(dplyr)

df %>% 
 group_by(Three) %>% 
 mutate(new = n() > 1) %>% 
 split(.$new)

给出,

$`FALSE`
# A tibble: 4 x 4
# Groups:   Three [4]
    One   Two Three new  
  <dbl> <dbl> <fct> <lgl>
1     1     4 a     FALSE
2     2     5 b     FALSE
3     7     1 f     FALSE
4     9     1 g     FALSE

$`TRUE`
# A tibble: 6 x 4
# Groups:   Three [3]
    One   Two Three new  
  <dbl> <dbl> <fct> <lgl>
1     3     3 c     TRUE 
2     4     6 d     TRUE 
3     5     2 d     TRUE 
4     6     7 e     TRUE 
5     8     8 e     TRUE 
6    10     9 c     TRUE 

答案 2 :(得分:0)

您可以使用底数R

One = c(1,2,3,4,5,6,7,8,9,10)
Two = c(4,5,3,6,2,7,1,8,1,9)
Three = c("a", "b", "c", "d","d","e","f","e","g","c")
df <- data.frame(One, Two, Three)

str(df)

df$Three <- as.character(df$Three)
df$count <- as.numeric(ave(df$Three,df$Three,FUN = length))

dfSingle = subset(df,df$count == 1)
dfMultiple = subset(df,df$count > 1)

答案 3 :(得分:0)

使用dplyr的方式:

library(dplyr)

df %>%
  group_split(Duplicated = (add_count(., Three) %>% pull(n)) > 1)

输出:

[[1]]
# A tibble: 4 x 4
    One   Two Three Duplicated
  <dbl> <dbl> <fct> <lgl>     
1     1     4 a     FALSE     
2     2     5 b     FALSE     
3     7     1 f     FALSE     
4     9     1 g     FALSE     

[[2]]
# A tibble: 6 x 4
    One   Two Three Duplicated
  <dbl> <dbl> <fct> <lgl>     
1     3     3 c     TRUE      
2     4     6 d     TRUE      
3     5     2 d     TRUE      
4     6     7 e     TRUE      
5     8     8 e     TRUE      
6    10     9 c     TRUE