如果基于差异的数据库中不存在条目,则查找最新条目

时间:2014-08-21 11:43:15

标签: sql sqlite group-by where-clause

我有一个由以下表组成的数据库:

function
id    |    name

version
id    |    name

data_table
id | function_id | version_id | date | arbitrary_data1

我通过解析文件来插入数据。 如果某个函数在新版本中发生了变化,我会存储该函数的diff。 如果它没有改变,即使该功能在新版本中实际存在,也不会插入任何数据。 因此,从技术上讲,如果文件中的功能已更新,我只存储新数据。

现在我需要一些复杂的查询,使基于差异的数据库看起来像普通的数据库, 其中每个版本都包含以前版本的完整数据。

data_table可能包含的示例:

data_table
id | function_id | version_id | date          | arbitrary_data
1    1             1            2012-01-01      0
2    2             1            2012-01-01      150
3    1             2            2012-01-02      100

我需要一个查询,它为我提供了特定函数的每个版本的arbitrary_data和日期。

function_id = 1的预期结果示例:

date          | arbitrary_data
2012-01-01      0                           <-- version 1
2012-01-02      100                         <-- version 2

我遇到的问题是因为当文件未在&#34中更新时缺少行;这个&#34;版。例如,如果我要提取函数#2的数据,则不会返回第二个版本的数据,因为它没有插入到数据库中。

现在的挑战是为每个版本生成完整的数据(每个文件的数据)。

查询需要: 为每个版本选择arbitrary_data和日期;如果没有特定版本的条目:找到最新的上一个文件条目和 从该行中选择arbitrary_data。 (仍应从原始行/版本中选择日期)

它必须与SQLite兼容,最好是快速的。

我有一组查询结合Python中的一些逻辑/脚本执行此操作,但每个版本的执行时间约为1秒;这对我需要的东西来说太慢了。以下是Python代码:

def get_data(self, function_id):

    #fvs is short for fileversions!

    #Gets the function ID and for each version ID
    all_fvs = self._conn.execute('''SELECT * FROM
                                    (SELECT id as function_id FROM function WHERE id = ?)
                                    CROSS JOIN
                                    (SELECT id as version_id from version)
                                    ''', [function_id]).fetchall()

    #Gets the function ID for each version ID that has been registered to the data_table
    registered_fvs = self._conn.execute('''SELECT function_id, version_id
                                            FROM data_table
                                            WHERE function_id = ?
                                            LIMIT 1
                                            ''', [function_id]).fetchall()

    #Gets the function ID for each version ID that has been registered to the data_table with incomplete arbitrary_data
    incomplete_registered_fvs = self._conn.execute('''SELECT arbitrary_data, version_id
                                                    FROM data_table
                                                    WHERE (arbitrary_data IS NULL OR date IS NULL)
                                                    GROUP BY version_id''').fetchall()

    #Gets the arbitrary_data we want for all the rows corresponding to registered_fvs
    data_set = self._conn.execute('''SELECT arbitrary_data, date from data_table
                                    WHERE function_id = ?
                                    ''', [function_id]).fetchall()

    #Converts the lists to counters so that we can perform set operations on them
    all_fvs_counter = Counter(all_fvs)
    registered_fvs_counter = Counter(registered_fvs)
    incomplete_registered_fvs_counter = Counter(incomplete_registered_fvs)

    #Filter out the registered fvs from all fvs
    non_registered_fvs = (all_fvs_counter-registered_fvs_counter)-incomplete_registered_fvs_counter

    #For all the versions that aren't registered, we fetch the latest value of a previous version which was registered
    for (function, version) in non_registered_fvs:
        data_set.append(self._conn.execute('''SELECT arbitrary_data, date
                                                FROM data_table
                                                WHERE function_id = ?
                                                AND date <= (SELECT date FROM data_table WHERE version_id = ? LIMIT 1)
                                                ORDER BY date DESC
                                                LIMIT 1
                                                ''', [function, version]).fetchone())

    return data_set

4 个答案:

答案 0 :(得分:2)

生成所有可能的文件和版本组合,然后加入表格以获取所需的数据:

select f.file_id, f.version_id, fv.data
from (select distinct file_id from fileversions) f cross join
     (select distinct version_id from fileversions) v left join
     fileversions fv
     on fv.file_id = f.file_id and
        fv.version_id = f.version_id;

没有特定版本的文件将NULL列为data

从先前版本填写数据有点困难。你可以这样做:

select f.file_id, f.version_id,
       (select fv.data
        from fileversions fv
        where fv.file_id = f.file_id and
              fv.version_id <= f.version_id
        order by fv.version_id desc
        limit 1
       ) as data
from (select distinct file_id from fileversions) f cross join
     (select distinct version_id from fileversions) v

答案 1 :(得分:1)

首先,我们需要文件和版本的所有组合:

SELECT file.id,
       version.id
FROM file
CROSS JOIN version

匹配数据是同一文件的数据,其最大版本不大于所需版本:

SELECT file.id,
       version.id,
       (SELECT id
        FROM data
        WHERE file_id = file.id         -- same file
          AND version_id <= version.id  -- same or earlier version
        ORDER BY version_id DESC        -- largest version first
        LIMIT 1
       ) AS data_id
FROM file
CROSS JOIN version

可以从data表中查找多个列,但这需要多个子查询。 最好使用我们已经查找过的data.id加入data表:

SELECT IDs.version_id,
       data.*
FROM (SELECT version.id AS version_id,
             (SELECT id
              FROM data
              WHERE file_id = file.id
                AND version_id <= version.id
              ORDER BY version_id DESC
              LIMIT 1
             ) AS data_id
      FROM file
      CROSS JOIN version) AS IDs
JOIN data ON IDs.data_id = data.id
ORDER BY IDs.version_id,
         data.file_id

答案 2 :(得分:1)

select max(maxDate.date), dt.arbitrary_data, f.id, v.id
from function f
cross join version v
inner join data_table dt
on dt.function_id = f.id
and dt.version_id <= v.id
left join data_table noIntermediate
on dt.function_id = noIntermediate.function_id
and noIntermediate.version_id <= v.id
and noIntermediate.version_id > dt.version_id
inner join data_table maxDate
on maxDate.version_id = v.id
where noIntermediate.function_id is null
group by f.id, v.id, dt.arbitrary_data

此查询基于所有函数和版本之间的交叉连接。对于每个函数,它从data_table获得该函数的最大版本,该版本不大于当前的迭代版本。这是通过反连接来完成的,该连接检查这两者之间不存在任何版本。 要获取版本的日期而不是任意数据行的日期,我必须添加另一个连接和一个组。在select,而不是max(),它可能是min()或avg()。组中的所有行应该具有相同的日期,任何行都可以。

答案 3 :(得分:0)

看看这个。我非常简化/修改了表定义,这是在postgresql上完成的,所以实际的sqlite代码会有所不同。但核心思想应该有效,我在其他地方做过这个。

视图中的聪明位。 version.vname上的这个subselect标准是sqllite需要做的。

drop view i f exists func_data_calc; 
drop table if exists versions;

/* and you would probably store things like the dates for the version on this table */
create table versions (vname varchar(20));

drop table if exists func_data;

/* very basic table. stores a function name, the version it refers to and the arb data*/
create table func_data (funcname varchar(20), vname varchar(20), arb_data varchar(20));

/* the trick here is that we pull up the latest arbitrary data 
for a function, for the most recent 
version that is not greater than the version
coming from the version table*/

create view func_data_calc (funcname, vname, arb_data) 
as select funcname, versions.vname, arb_data 
from func_data, versions
where func_data.vname = 
(select max(vname) 
 from func_data s 
 where func_data.funcname = s.funcname 
 and s.vname <= versions.vname);

/* test it */
insert into func_data (funcname, vname, arb_data) 
values ('func1', 'ver01', 'data func1 ver01');

/* oh, no, func1 has nothing for ver02, wasn't touched*/

insert into func_data (funcname, vname, arb_data) 
values ('func2', 'ver01', 'data func2 ver01');

insert into func_data (funcname, vname, arb_data) 
values ('func2', 'ver02', 'data func2 ver02');

/* insert some dummy version data from the function_data*/
insert into versions (vname) select distinct vname from func_data;


select * from func_data_calc
order by funcname, vname;

选择的结果?注意func1如何有ver02数据,即使它指向ver01的diff?版本=(选择最大(版本)...以上是堵塞漏洞的原因。

funcname|vname|arb_data
"func1";"ver01";"data func1 ver01"
"func1";"ver02";"data func1 ver01"
"func2";"ver01";"data func2 ver01"
"func2";"ver02";"data func2 ver02"

注意需要添加的一件事 - 一个闭包机制,表明某个函数在某个版本中不再存在。