我有一个由以下表组成的数据库:
function
id | name
version
id | name
data_table
id | function_id | version_id | date | arbitrary_data1
我通过解析文件来插入数据。 如果某个函数在新版本中发生了变化,我会存储该函数的diff。 如果它没有改变,即使该功能在新版本中实际存在,也不会插入任何数据。 因此,从技术上讲,如果文件中的功能已更新,我只存储新数据。
现在我需要一些复杂的查询,使基于差异的数据库看起来像普通的数据库, 其中每个版本都包含以前版本的完整数据。
data_table可能包含的示例:
data_table
id | function_id | version_id | date | arbitrary_data
1 1 1 2012-01-01 0
2 2 1 2012-01-01 150
3 1 2 2012-01-02 100
我需要一个查询,它为我提供了特定函数的每个版本的arbitrary_data和日期。
function_id = 1的预期结果示例:
date | arbitrary_data
2012-01-01 0 <-- version 1
2012-01-02 100 <-- version 2
我遇到的问题是因为当文件未在&#34中更新时缺少行;这个&#34;版。例如,如果我要提取函数#2的数据,则不会返回第二个版本的数据,因为它没有插入到数据库中。
现在的挑战是为每个版本生成完整的数据(每个文件的数据)。
查询需要: 为每个版本选择arbitrary_data和日期;如果没有特定版本的条目:找到最新的上一个文件条目和 从该行中选择arbitrary_data。 (仍应从原始行/版本中选择日期)
它必须与SQLite兼容,最好是快速的。
我有一组查询结合Python中的一些逻辑/脚本执行此操作,但每个版本的执行时间约为1秒;这对我需要的东西来说太慢了。以下是Python代码:
def get_data(self, function_id):
#fvs is short for fileversions!
#Gets the function ID and for each version ID
all_fvs = self._conn.execute('''SELECT * FROM
(SELECT id as function_id FROM function WHERE id = ?)
CROSS JOIN
(SELECT id as version_id from version)
''', [function_id]).fetchall()
#Gets the function ID for each version ID that has been registered to the data_table
registered_fvs = self._conn.execute('''SELECT function_id, version_id
FROM data_table
WHERE function_id = ?
LIMIT 1
''', [function_id]).fetchall()
#Gets the function ID for each version ID that has been registered to the data_table with incomplete arbitrary_data
incomplete_registered_fvs = self._conn.execute('''SELECT arbitrary_data, version_id
FROM data_table
WHERE (arbitrary_data IS NULL OR date IS NULL)
GROUP BY version_id''').fetchall()
#Gets the arbitrary_data we want for all the rows corresponding to registered_fvs
data_set = self._conn.execute('''SELECT arbitrary_data, date from data_table
WHERE function_id = ?
''', [function_id]).fetchall()
#Converts the lists to counters so that we can perform set operations on them
all_fvs_counter = Counter(all_fvs)
registered_fvs_counter = Counter(registered_fvs)
incomplete_registered_fvs_counter = Counter(incomplete_registered_fvs)
#Filter out the registered fvs from all fvs
non_registered_fvs = (all_fvs_counter-registered_fvs_counter)-incomplete_registered_fvs_counter
#For all the versions that aren't registered, we fetch the latest value of a previous version which was registered
for (function, version) in non_registered_fvs:
data_set.append(self._conn.execute('''SELECT arbitrary_data, date
FROM data_table
WHERE function_id = ?
AND date <= (SELECT date FROM data_table WHERE version_id = ? LIMIT 1)
ORDER BY date DESC
LIMIT 1
''', [function, version]).fetchone())
return data_set
答案 0 :(得分:2)
生成所有可能的文件和版本组合,然后加入表格以获取所需的数据:
select f.file_id, f.version_id, fv.data
from (select distinct file_id from fileversions) f cross join
(select distinct version_id from fileversions) v left join
fileversions fv
on fv.file_id = f.file_id and
fv.version_id = f.version_id;
没有特定版本的文件将NULL
列为data
。
从先前版本填写数据有点困难。你可以这样做:
select f.file_id, f.version_id,
(select fv.data
from fileversions fv
where fv.file_id = f.file_id and
fv.version_id <= f.version_id
order by fv.version_id desc
limit 1
) as data
from (select distinct file_id from fileversions) f cross join
(select distinct version_id from fileversions) v
答案 1 :(得分:1)
首先,我们需要文件和版本的所有组合:
SELECT file.id,
version.id
FROM file
CROSS JOIN version
匹配数据是同一文件的数据,其最大版本不大于所需版本:
SELECT file.id,
version.id,
(SELECT id
FROM data
WHERE file_id = file.id -- same file
AND version_id <= version.id -- same or earlier version
ORDER BY version_id DESC -- largest version first
LIMIT 1
) AS data_id
FROM file
CROSS JOIN version
可以从data
表中查找多个列,但这需要多个子查询。
最好使用我们已经查找过的data.id
加入data
表:
SELECT IDs.version_id,
data.*
FROM (SELECT version.id AS version_id,
(SELECT id
FROM data
WHERE file_id = file.id
AND version_id <= version.id
ORDER BY version_id DESC
LIMIT 1
) AS data_id
FROM file
CROSS JOIN version) AS IDs
JOIN data ON IDs.data_id = data.id
ORDER BY IDs.version_id,
data.file_id
答案 2 :(得分:1)
select max(maxDate.date), dt.arbitrary_data, f.id, v.id
from function f
cross join version v
inner join data_table dt
on dt.function_id = f.id
and dt.version_id <= v.id
left join data_table noIntermediate
on dt.function_id = noIntermediate.function_id
and noIntermediate.version_id <= v.id
and noIntermediate.version_id > dt.version_id
inner join data_table maxDate
on maxDate.version_id = v.id
where noIntermediate.function_id is null
group by f.id, v.id, dt.arbitrary_data
此查询基于所有函数和版本之间的交叉连接。对于每个函数,它从data_table获得该函数的最大版本,该版本不大于当前的迭代版本。这是通过反连接来完成的,该连接检查这两者之间不存在任何版本。 要获取版本的日期而不是任意数据行的日期,我必须添加另一个连接和一个组。在select,而不是max(),它可能是min()或avg()。组中的所有行应该具有相同的日期,任何行都可以。
答案 3 :(得分:0)
看看这个。我非常简化/修改了表定义,这是在postgresql上完成的,所以实际的sqlite代码会有所不同。但核心思想应该有效,我在其他地方做过这个。
视图中的聪明位。 version.vname上的这个subselect标准是sqllite需要做的。
drop view i f exists func_data_calc;
drop table if exists versions;
/* and you would probably store things like the dates for the version on this table */
create table versions (vname varchar(20));
drop table if exists func_data;
/* very basic table. stores a function name, the version it refers to and the arb data*/
create table func_data (funcname varchar(20), vname varchar(20), arb_data varchar(20));
/* the trick here is that we pull up the latest arbitrary data
for a function, for the most recent
version that is not greater than the version
coming from the version table*/
create view func_data_calc (funcname, vname, arb_data)
as select funcname, versions.vname, arb_data
from func_data, versions
where func_data.vname =
(select max(vname)
from func_data s
where func_data.funcname = s.funcname
and s.vname <= versions.vname);
/* test it */
insert into func_data (funcname, vname, arb_data)
values ('func1', 'ver01', 'data func1 ver01');
/* oh, no, func1 has nothing for ver02, wasn't touched*/
insert into func_data (funcname, vname, arb_data)
values ('func2', 'ver01', 'data func2 ver01');
insert into func_data (funcname, vname, arb_data)
values ('func2', 'ver02', 'data func2 ver02');
/* insert some dummy version data from the function_data*/
insert into versions (vname) select distinct vname from func_data;
select * from func_data_calc
order by funcname, vname;
选择的结果?注意func1如何有ver02数据,即使它指向ver01的diff?版本=(选择最大(版本)...以上是堵塞漏洞的原因。
funcname|vname|arb_data
"func1";"ver01";"data func1 ver01"
"func1";"ver02";"data func1 ver01"
"func2";"ver01";"data func2 ver01"
"func2";"ver02";"data func2 ver02"
注意需要添加的一件事 - 一个闭包机制,表明某个函数在某个版本中不再存在。