我有一些表格,我是通过RODBC
从数据库中提取的。第一个具有主键字段__ID
。
dfA <- data.frame(
`__ID` = c("a1","a2","a3"),
col=c(1,2,3),
check.names = FALSE )
__ID col
1 a1 1
2 a2 2
3 a3 3
第二个包含以_ID
开头的外键字段。
dfB <- data.frame(
"_ID0" = c("z1", "z2", "z3"),
"_ID1" = c("a1", "b1", "c1"),
`_ID2` = c("a1", "a2", "c1"),
`_ID3` = c("a1", "a2", "a3"),
check.names = FALSE )
_ID0 _ID1 _ID2 _ID3
1 z1 a1 a1 a1
2 z2 b1 a2 a2
3 z3 c1 c1 a3
我想生成以下数据框,其中包含上面两个表的名称,并且具有第一个表中的主键字段和另一个表中的外键字段之间的所有成对组合。对于每对,它显示名为intersects
的列中的交叉值的数量。
matches <- data.frame(
pk_table = "dfA",
pk=c("__ID", "__ID","__ID","__ID"),
fk_table= c("dfB", "dfB","dfB","dfB"),
fk=c("_ID0", "_ID1", "_ID2", "_ID3"),
intersects=c(0, 1,2,3),
check.names = FALSE )
pk_table pk fk_table fk intersects
1 dfA __ID dfB _ID0 0
2 dfA __ID dfB _ID1 1
3 dfA __ID dfB _ID2 2
4 dfA __ID dfB _ID3 3
以下是如何计算intersects
列的单个值的示例。返回值1是因为__ID
列中有一个值也位于_ID1
中。
length( intersect(dfA$`__ID`, dfB$`_ID1`) )
如何在没有循环的情况下创建上述内容?我希望有一个接受以下输入的解决方案:
dfB
,dfC
等)然后,该函数应计算主键字段与所提供的所有其他数据结构的所有其他列之间的所有匹配。总的来说,我的数据库在15个表中有700列。我的主键字段位于一个表中,我想计算此列中的值在所有15个表(包括找到它的同一个表)的每个列中出现的次数。我不能假设外键列遵循特定的命名约定,但数据库中的数据总量小于50MB,因此我不希望出现性能问题。
答案 0 :(得分:1)
这应该可以解决问题:
library(dplyr)
library(tidyr)
options(stringsAsFactors = F)
dfA <- data.frame(
`__ID` = c("a1","a2","a3"),
col=c(1,2,3),
check.names = FALSE )
dfB <- data.frame(
fk_table = c("dfB", "dfB","dfB"), #added a column with the table name
`_ID0` = c("z1", "z2", "z3"),
`_ID1` = c("a1", "b1", "c1"),
`_ID2` = c("a1", "a2", "c1"),
`_ID3` = c("a1", "a2", "a3"),
check.names = FALSE )
dfB%>%
# first we gather the dataframe to long, tidy format
gather(key = fk, value = value, `_ID0`:`_ID3`)%>%
# then we do a left join.
# this introduces NA's for values (e.g. c1) that are not in dfA
left_join(dfA, by = c("value" = "__ID"))%>%
# Now we group by fk name (e.g. _ID0)
group_by(fk_table, fk)%>%
# And we count how often the result is not NA
# an inner_join followed by counting the rows would be simpler
# but then you don't get zero values as in the example
summarise(intersects=sum(!is.na(col)))
返回以下内容:
fk_table fk intersects
1 dfB _ID0 0
2 dfB _ID1 1
3 dfB _ID2 2
4 dfB _ID3 3
唯一的区别是你在最终结果中没有pk和pk_table列,但我想添加它并不困难。