计算一个数据框列与其他数据框的所有列之间的交叉值的数量

时间:2016-09-28 14:19:51

标签: r database dplyr

我有一些表格,我是通过RODBC从数据库中提取的。第一个具有主键字段__ID

dfA  <- data.frame(
`__ID` = c("a1","a2","a3"), 
col=c(1,2,3), 
check.names = FALSE )

  __ID col
1   a1   1
2   a2   2
3   a3   3

第二个包含以_ID开头的外键字段。

dfB  <- data.frame(
"_ID0" = c("z1", "z2", "z3"), 
"_ID1" = c("a1", "b1", "c1"), 
`_ID2` = c("a1", "a2", "c1"), 
`_ID3` = c("a1", "a2", "a3"), 
check.names = FALSE  )

  _ID0 _ID1 _ID2 _ID3
1   z1   a1   a1   a1
2   z2   b1   a2   a2
3   z3   c1   c1   a3

我想生成以下数据框,其中包含上面两个表的名称,并且具有第一个表中的主键字段和另一个表中的外键字段之间的所有成对组合。对于每对,它显示名为intersects的列中的交叉值的数量。

matches  <- data.frame(
pk_table = "dfA", 
pk=c("__ID", "__ID","__ID","__ID"), 
fk_table= c("dfB", "dfB","dfB","dfB"), 
fk=c("_ID0", "_ID1", "_ID2", "_ID3"), 
intersects=c(0, 1,2,3), 
check.names = FALSE )

  pk_table   pk fk_table   fk intersects
1      dfA __ID      dfB _ID0          0
2      dfA __ID      dfB _ID1          1
3      dfA __ID      dfB _ID2          2
4      dfA __ID      dfB _ID3          3

以下是如何计算intersects列的单个值的示例。返回值1是因为__ID列中有一个值也位于_ID1中。

length( intersect(dfA$`__ID`, dfB$`_ID1`) )

如何在没有循环的情况下创建上述内容?我希望有一个接受以下输入的解决方案:

  • 主键字段的表名和列名
  • 所有其他数据结构(dfBdfC等)

然后,该函数应计算主键字段与所提供的所有其他数据结构的所有其他列之间的所有匹配。总的来说,我的数据库在15个表中有700列。我的主键字段位于一个表中,我想计算此列中的值在所有15个表(包括找到它的同一个表)的每个列中出现的次数。我不能假设外键列遵循特定的命名约定,但数据库中的数据总量小于50MB,因此我不希望出现性能问题。

1 个答案:

答案 0 :(得分:1)

这应该可以解决问题:

library(dplyr)
library(tidyr)

options(stringsAsFactors = F)

dfA  <- data.frame(
  `__ID` = c("a1","a2","a3"), 
  col=c(1,2,3),
  check.names = FALSE )

dfB  <- data.frame(
  fk_table = c("dfB", "dfB","dfB"), #added a column with the table name
  `_ID0` = c("z1", "z2", "z3"), 
  `_ID1` = c("a1", "b1", "c1"), 
  `_ID2` = c("a1", "a2", "c1"), 
  `_ID3` = c("a1", "a2", "a3"), 
  check.names = FALSE  )

dfB%>%
  # first we gather the dataframe to long, tidy format
  gather(key = fk, value = value, `_ID0`:`_ID3`)%>%

  # then we do a left join. 
  # this introduces NA's for values (e.g. c1) that are not in dfA
  left_join(dfA, by = c("value" = "__ID"))%>%

  # Now we group by fk name (e.g. _ID0)
  group_by(fk_table, fk)%>%

  # And we count how often the result is not NA
  # an inner_join followed by counting the rows would be simpler
  # but then you don't get zero values as in the example
  summarise(intersects=sum(!is.na(col)))

返回以下内容:

  fk_table    fk intersects
1      dfB  _ID0          0
2      dfB  _ID1          1
3      dfB  _ID2          2
4      dfB  _ID3          3

唯一的区别是你在最终结果中没有pk和pk_table列,但我想添加它并不困难。