Question

我有大量的数据仓库查询中使用的表组合。

示例：

QueryID    Table
--------   ------
1          t1
1          t2
1          t3
--------------
2          t1
2          t3
--------------
3          t4
3          t2
--------------
4          t2
4          t3
4          t4
--------------
5          t3
5          t1

......还有更多......（大约有数千种这样的组合）。我目前的分析涉及从最常用的表格等中找出模式。我的下一个分析点是找到可以使用最少数量的表运行的最大查询数，并考虑表的大小。

例如，根据上述数据，我们可以通过使用三个表组合（t1，t2，t3）和（t1，t3）和（t2，t3，t4）等运行最少2个查询。 ...例如，如果表的大小是

table    size
-----    -----
t1        20 GB
t2        40 GB
t3        10 GB
t4        50 GB

然后

（t1，t2，t3）可以运行三个查询
（t1，t3）可以运行两个查询
（t2，t3，t4）一起大小为100 GB可以运行两个查询

其中（t1，t3）是具有最小大小和计数的表的最佳组合，可以运行两个查询。我正在尝试使用SQL，Excel，R的多种方法来提出动态解决方案，它可以占用您想要运行的查询数量，您想要容忍的表组合的最大大小等参数。这里有任何最佳方法或建议将不胜感激。

更新

查询将需要所有参与的表都可以运行。所以我们不能说t1单独满足两个查询的要求，或者单独的t2可以满足3个查询的要求。

Answer 1

我不清楚如何计算一组表的查询数。我可以看到t1有两个查询，t2有三个查询，t3有三个查询。如何计算表t1，t2和t3组合的查询数量？

# the data that you posted
quer <- structure(list(QueryID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L), 
    Table = c("t1", "t2", "t3", "t1", "t3", "t4", "t2", "t2", "t3", "t4")), 
    .Names = c("QueryID", "Table"), class = "data.frame", row.names = c(NA, -10L))
size <- structure(list(Table = c("t1", "t2", "t3", "t4"), 
    GB = c(20L, 40L, 10L, 50L)), .Names = c("Table", "GB"), 
    class = "data.frame", row.names = c(NA, -4L))

# perhaps a helpful way to reorganize your query data
both <- merge(quer, size)
size2 <- tapply(both$GB, list(both$QueryID, both$Table), mean)
size2
  t1 t2 t3 t4
1 20 40 10 NA
2 20 NA 10 NA
3 NA 40 NA 50
4 NA 40 10 50

apply(size2, 1, sum, na.rm=TRUE)
  1   2   3   4 
 70  30  90 100

Answer 2

如果我理解正确，您需要运行x个查询所需的最小尺寸

library(data.table)

#datasets borrowed from @androboy s answer (removed special character for code formatting to work)
quer <- structure(list(QueryID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L), 
                       Table = c("t1", "t2", "t3", "t1", "t3", "t4", "t2", "t2", "t3", "t4")), 
                  .Names = c("QueryID", "Table"), class = "data.frame", row.names = c(NA, -10L))
size <- structure(list(Table = c("t1", "t2", "t3", "t4"), 
                       GB = c(20L, 40L, 10L, 50L)), .Names = c("Table", "GB"), 
                  class = "data.frame", row.names = c(NA, -4L))

quer <- data.table(quer)
size <- data.table(size)
# number of queries to run
queries <- 2

# creating unique combinations of two queries each-------------------
querylist <- vector(mode="list",length=queries)
for(i in seq(queries))
{
  querylist[[i]]<-unique(quer$QueryID)
}
qdf <- (expand.grid(querylist))
# removing rows with same query counted twice
if ( queries > 1)
{
  test <- apply(
    t(combn(seq(queries),2)),
    1,
    function(x)
    {
      (qdf[,x[1]] != qdf[,x[2]])
    }
  )

  qdf <- qdf[rowSums(test) == (queries-1),]
}

qdf <- data.table(qdf)
# checking tablesizes needed to run this combination-------------------
qdf[,tablesneeded := '']
qdf[,sizeneeded := as.integer(NA)]

setkeyv(quer,'QueryID')
setkeyv(size,'Table')

for( i in seq(nrow(qdf)))
{
  Tables <- quer[data.table(V1 =unlist(qdf[i,grep(colnames(qdf), pattern = "Var", value = TRUE), with = FALSE]))[,keyby = 'V1']][,unique(Table)]

  qdf[i, tablesneeded := paste(Tables,collapse = ',')]

  qdf[i, sizeneeded := as.integer(sum(size[data.table(V1 = Tables)[, keyby = 'V1']][,GB], na.rm = TRUE))]

}

# lowest size option for current number of queries-----------------------
qdf[which.min(sizeneeded)]

Answer 3

你应该在允许的大小范围内对所有表格组合进行暴力破解，并获得覆盖最大查询的表格，它将生成所有子集O(2^N*N)。如果粗暴是不可行的，那么我担心你的问题至少比背包问题更困难，并且没有多项式时间解决方案。另一种情况是，一个好的解决方案将为你做贪婪的背包或使用遗传算法来获得良好的可行解决方案

多层组合背包

3 个答案: