如何提取矩阵中的重复行

时间:2019-02-20 11:05:04

标签: r

我有这个矩阵

> Y
>      [,1] [,2] [,3] [,4]
[1,] "0"  "2"  "9"  "5" 
[2,] "4"  "7"  "7"  "3" 
[3,] "1"  "5"  "7"  "9" 
[4,] "7"  "8"  "7"  "4" 
[5,] "7"  "8"  "7"  "4" 
[6,] "1"  "1"  "7"  "2" 
[7,] "7"  "8"  "7"  "4" 
...

我要从该矩阵中获取所有重复行,这些行重复1次,2次,3次,依此类推。

例如

  

“ 7”“ 8”“ 7”“ 4”

在Y中出现3次。如何找到所有其他情况?

因此输出应为:

返回在Y中出现两次的所有行。

返回在Y中出现3次的所有行。

返回在Y中出现4次或更多次的所有行。

我试图用

解决此问题
> duplicate

命令,但这还不够。

2 个答案:

答案 0 :(得分:2)

这是一个简单的解决方案,其基础是将矩阵的行连接成一个字符串,然后列表显示字符串出现的频率。

首先,我们将生成一些简单的伪数据。我生成随机的零和一,以确保将有大量重复项:

Y <- matrix(rbinom(100, 1, .5), ncol = 4)
head(Y)
#>      [,1] [,2] [,3] [,4]
#> [1,]    0    0    0    1
#> [2,]    0    0    0    0
#> [3,]    0    0    0    0
#> [4,]    0    0    0    1
#> [5,]    0    1    1    0
#> [6,]    0    0    1    0

# I collapse all the values in each row into a string, so c(0,1,0,1) becomes "0101"
row.ids <- apply(Y, 1, paste, collapse = "")
# Now using table() I can get the frequency with which each pattern appears
row.freqs <- table(row.ids)

# All triply replicated rows
Y[row.ids %in% names(row.freqs[row.freqs==3]),]
#>      [,1] [,2] [,3] [,4]
#> [1,]    0    0    0    1
#> [2,]    0    0    0    1
#> [3,]    0    1    1    0
#> [4,]    0    0    0    1
#> [5,]    0    1    1    0
#> [6,]    0    1    1    0

# All quadruply replicated rows
Y[row.ids %in% names(row.freqs[row.freqs==4]),]
#>       [,1] [,2] [,3] [,4]
#>  [1,]    0    0    0    0
#>  [2,]    0    0    0    0
#>  [3,]    0    0    1    0
#>  [4,]    0    0    1    0
#>  [5,]    0    0    0    0
#>  [6,]    0    0    1    0
#>  [7,]    0    1    1    1
#>  [8,]    0    1    1    1
#>  [9,]    0    1    1    1
#> [10,]    0    0    0    0
#> [11,]    0    1    1    1
#> [12,]    0    0    1    0

reprex package(v0.2.1)于2019-02-20创建

答案 1 :(得分:1)

最后使用注释中的测试矩阵Y,使用aggregate创建一个数据帧ag,其行是Y的唯一行,然后是计算它们发生多少次。

ag <- aggregate(cbind(count = apply(Y, 1, toString)) ~ ., as.data.frame(Y), 
  FUN = length)

nc <- ncol(Y)
subset(ag, count == 2, select = -count) # shows rows which occur twice

split(ag[1:nc], ag$count) # splits unique rows into those that occur once, twice, etc.

注意

Y <- matrix(c(0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 
0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 
0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 
0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1), 25, 4)