只保留在某些列中具有相同元素的矩阵中的那些行

时间:2015-03-13 21:57:56

标签: r

让我举个例子。考虑我们有3个表(专注于列N):

   Table 1         Table 2        Table 3
-------------   -------------   -------------
  N   Values      N   Values     N   Values
-------------   -------------   -------------
  5     1         5    -1         5     1
  10    2         6    -2         6     21
  15    3         10   -3         10    5
                  15   -4         12    6
                                  15    3

我想删除多余的行,以便所有表都具有相同的列N。 结果:

   Table 1         Table 2        Table 3
-------------   -------------   -------------
  N   Values      N   Values     N   Values
-------------   -------------   -------------
  5     1         5    -1         5     1
  10    2         10   -3         10    5  
  15    3         15   -4         15    3

我相信在R中有一些简单的方法,但我绝对是菜鸟。我非常感谢你的帮助!

可重复数据

Table1 <- structure(list(N = c(5L, 10L, 15L), Values = 1:3), .Names = c("N", 
"Values"), row.names = c(NA, 3L), class = "data.frame")

Table2 <- structure(list(N = c(5L, 6L, 10L, 15L), Values = c(-1L, -2L, 
-3L, -4L)), .Names = c("N", "Values"), row.names = c(NA, 4L), class = "data.frame")

Table3 <- structure(list(N = c(5L, 6L, 10L, 12L, 15L), Values = c(1L, 21L, 
5L, 6L, 3L)), .Names = c("N", "Values"), row.names = c(NA, 5L
), class = "data.frame")

4 个答案:

答案 0 :(得分:1)

使用集合交集来查找所有表中的N的公共值

> t1 <-data.frame(N=c(5,10,15),Values=c(1,2,3))
> t2 <-data.frame(N=c(5,6,10,15),Values=c(-1,-2,-3,-4))
> t3 <-data.frame(N=c(5,6,10,12,15),Values=c(1,21,5,6,3))
> common<-intersect(intersect(t1$N,t2$N),t3$N)
> common
[1]  5 10 15

然后只是将每个表子集化以查找具有这些公共值的行

> newt1<-t1[t1$N %in% common,]
> newt2<-t2[t2$N %in% common,]
> newt3<-t3[t3$N %in% common,]
> newt3
   N Values
1  5      1
3 10      5
5 15      3

此方法应进行缩放,以便您可以创建函数并传入数据框和列名称的向量。它可以返回新数据帧的向量。

我使用过数据帧。相同的方法适用于矩阵

答案 1 :(得分:1)

我想提出一种适用于任意数量的数据帧以及多个id列的通用方法。

数据帧可以具有不同的结构,即不同数量和类型的列。唯一的要求是数据帧共享具有相同名称和类型的所有id列。此外,它还会检测数据帧之间是否存在 no id值的常见组合。

假设我们有一个数据框列表dfl和一个列名cn的向量,应检查列表中所有数据框的公共值组合:

dfl <- list(Table1, Table2, Table3)
cn <- "N"

library(data.table)
# determine common combinations of id values
common <- rbindlist(lapply(dfl, function(x) setDT(x)[, .SD, .SDcols = cn]))[
  , .(.cnt = .N), by = cn][.cnt == length(dfl)][, -".cnt"]
# stop if there are no column id values
stopifnot(nrow(common) > 0L)
# join with all data tables in dfl, keeping only rows which have common id values
result <- lapply(dfl, function(x) x[common, on = cn, nomatch = 0L])

result
$Table1
    N Values
1:  5      1
2: 10      2
3: 15      3

$Table2
    N Values
1:  5     -1
2: 10     -3
3: 15     -4

$Table3
    N Values
1:  5      1
2: 10      5
3: 15      3

数据

dfl <- structure(list(Table1 = structure(list(N = c(5L, 10L, 15L), Values = 1:3), .Names = c("N", 
"Values"), row.names = c(NA, 3L), class = "data.frame"), Table2 = structure(list(
    N = c(5L, 6L, 10L, 15L), Values = c(-1L, -2L, -3L, -4L)), .Names = c("N", 
"Values"), row.names = c(NA, 4L), class = "data.frame"), Table3 = structure(list(
    N = c(5L, 6L, 10L, 12L, 15L), Values = c(1L, 21L, 5L, 6L, 
    3L)), .Names = c("N", "Values"), row.names = c(NA, 5L), class = "data.frame")), .Names = c("Table1", 
"Table2", "Table3"))

具有多个id列的示例

# create sample data: 5 dataframes with 100 rows each and 3 id columns
set.seed(123L)
ndf <- 5L
dfl <- lapply(seq_len(ndf), function(i) {
  nr <- 100L
  nseq <- 1:6
  data.frame(A = sample(LETTERS[nseq], nr, replace = TRUE),
             b = sample(letters[nseq], nr, replace = TRUE),
             i = sample(nseq, nr, replace = TRUE),
             val = sample.int(nr, nr))
  })
dfl <- setNames(dfl, paste0("df", seq_along(dfl)))
str(dfl)
List of 5
 $ df1:'data.frame':  100 obs. of  4 variables:
  ..$ A  : Factor w/ 6 levels "A","B","C","D",..: 2 5 3 6 6 1 4 6 4 3 ...
  ..$ b  : Factor w/ 6 levels "a","b","c","d",..: 4 2 3 6 3 6 6 4 3 1 ...
  ..$ i  : int [1:100] 2 6 4 4 3 6 3 2 2 2 ...
  ..$ val: int [1:100] 79 1 77 71 61 46 15 99 42 45 ...
 $ df2:'data.frame':  100 obs. of  4 variables:
  ..$ A  : Factor w/ 6 levels "A","B","C","D",..: 6 1 6 4 3 3 5 1 3 5 ...
  ..$ b  : Factor w/ 6 levels "a","b","c","d",..: 3 3 2 1 3 2 4 4 6 3 ...
  ..$ i  : int [1:100] 2 5 2 2 2 5 1 5 2 3 ...
  ..$ val: int [1:100] 85 26 3 84 33 61 52 36 18 40 ...
 $ df3:'data.frame':  100 obs. of  4 variables:
  ..$ A  : Factor w/ 6 levels "A","B","C","D",..: 3 3 1 1 2 6 3 3 5 5 ...
  ..$ b  : Factor w/ 6 levels "a","b","c","d",..: 6 4 6 4 5 4 5 6 5 1 ...
  ..$ i  : int [1:100] 2 4 1 6 6 3 5 2 1 3 ...
  ..$ val: int [1:100] 81 73 22 99 84 51 57 88 93 61 ...
 $ df4:'data.frame':  100 obs. of  4 variables:
  ..$ A  : Factor w/ 6 levels "A","B","C","D",..: 6 6 3 5 3 6 1 1 5 4 ...
  ..$ b  : Factor w/ 6 levels "a","b","c","d",..: 1 3 4 6 5 4 1 1 5 1 ...
  ..$ i  : int [1:100] 2 2 1 3 2 5 4 6 1 6 ...
  ..$ val: int [1:100] 94 98 45 23 67 53 55 41 40 100 ...
 $ df5:'data.frame':  100 obs. of  4 variables:
  ..$ A  : Factor w/ 6 levels "A","B","C","D",..: 4 1 2 5 5 1 6 1 4 3 ...
  ..$ b  : Factor w/ 6 levels "a","b","c","d",..: 5 1 3 6 6 5 1 4 6 4 ...
  ..$ i  : int [1:100] 1 6 2 5 4 1 6 4 6 4 ...
  ..$ val: int [1:100] 45 28 16 85 54 53 56 68 59 94 ...
# define id columns
cn <- c("i", "A", "b")

common <- rbindlist(lapply(dfl, function(x) setDT(x)[, .SD, .SDcols = cn]))[
  , .(.cnt = .N), by = cn][.cnt == length(dfl)][, -".cnt"]
stopifnot(nrow(common) > 0L)
result <- lapply(dfl, function(x) x[common, on = cn, nomatch = 0L])

str(result)
List of 5
 $ df1:Classes ‘data.table’ and 'data.frame': 10 obs. of  4 variables:
  ..$ A  : Factor w/ 6 levels "A","B","C","D",..: 6 6 6 6 6 6 4 2 1 5
  ..$ b  : Factor w/ 6 levels "a","b","c","d",..: 4 4 4 6 6 3 2 3 4 2
  ..$ i  : int [1:10] 2 2 2 3 3 6 5 6 4 1
  ..$ val: int [1:10] 99 85 4 36 83 70 12 52 53 58
  ..- attr(*, ".internal.selfref")=<externalptr> 
 $ df2:Classes ‘data.table’ and 'data.frame': 11 obs. of  4 variables:
  ..$ A  : Factor w/ 6 levels "A","B","C","D",..: 6 6 4 4 2 1 5 5 4 1 ...
  ..$ b  : Factor w/ 6 levels "a","b","c","d",..: 4 3 2 2 3 4 4 4 1 1 ...
  ..$ i  : int [1:11] 2 6 5 5 6 4 1 1 5 3 ...
  ..$ val: int [1:11] 11 1 58 14 5 71 52 39 81 88 ...
  ..- attr(*, ".internal.selfref")=<externalptr> 
 $ df3:Classes ‘data.table’ and 'data.frame': 14 obs. of  4 variables:
  ..$ A  : Factor w/ 6 levels "A","B","C","D",..: 6 4 2 1 1 5 5 5 5 5 ...
  ..$ b  : Factor w/ 6 levels "a","b","c","d",..: 6 2 3 4 4 2 2 4 4 4 ...
  ..$ i  : int [1:14] 3 5 6 4 4 1 1 1 1 1 ...
  ..$ val: int [1:14] 25 60 18 78 59 26 32 39 77 28 ...
  ..- attr(*, ".internal.selfref")=<externalptr> 
 $ df4:Classes ‘data.table’ and 'data.frame': 14 obs. of  4 variables:
  ..$ A  : Factor w/ 6 levels "A","B","C","D",..: 6 6 6 4 2 2 5 5 4 4 ...
  ..$ b  : Factor w/ 6 levels "a","b","c","d",..: 6 3 3 2 3 3 2 2 1 1 ...
  ..$ i  : int [1:14] 3 6 6 5 6 6 1 1 5 5 ...
  ..$ val: int [1:14] 56 86 34 70 31 12 72 1 5 64 ...
  ..- attr(*, ".internal.selfref")=<externalptr> 
 $ df5:Classes ‘data.table’ and 'data.frame': 6 obs. of  4 variables:
  ..$ A  : Factor w/ 6 levels "A","B","C","D",..: 6 6 6 1 1 2
  ..$ b  : Factor w/ 6 levels "a","b","c","d",..: 4 6 3 4 1 4
  ..$ i  : int [1:6] 2 3 6 4 3 4
  ..$ val: int [1:6] 11 48 1 68 32 46
  ..- attr(*, ".internal.selfref")=<externalptr>

在每个数据框中,只剩下几行共享id值的常见组合:

unlist(lapply(result, nrow))
df1 df2 df3 df4 df5 
 10  11  14  14   6

答案 2 :(得分:0)

一旦找到&#34;共同点&#34; (这里是表1),你可以这样做:

Table2 <- Table2[Table2$N %in% Table1$N,]
Table3 <- Table3[Table3$N %in% Table1$N,]

答案 3 :(得分:0)

这是一种更适用于任何表列表的功能方式。首先,我们提取所有'N'列,然后得到所有这些值的交集。然后我们只过滤每个表。

library('tidyverse')

tables <- list(Table1, Table2, Table3)

common <- tables %>%
  map('N') %>%
  reduce(intersect)

tables %>%
  map(filter, N %in% common)
# [[1]]
#    N Values
# 1  5      1
# 2 10      2
# 3 15      3
# 
# [[2]]
#    N Values
# 1  5     -1
# 2 10     -3
# 3 15     -4
# 
# [[3]]
#    N Values
# 1  5      1
# 2 10      5
# 3 15      3