我正在制作一份报告,该报告将使用导入的数据提供缺失序列列表:
CREATE TABLE `client_trans`
(
`id` INT NOT NULL AUTO_INCREMENT,
`client_id` INT NULL,
`sequence` INT NULL,
`other_data` INT NULL,
PRIMARY KEY (`id`),
INDEX `client_id_seq` (`client_id` ASC, `sequence` ASC)
);
除了id字段外,没有真正唯一的值,甚至没有值的组合
此表的数据如下所示(忽略other_data字段):
id client_id sequence
1 1000 1
2 1000 2
3 1000 2
4 1000 3
5 1001 1
6 1001 5
7 1001 6
8 1002 4
9 1002 6
如上例所示,可能有多个client_id / sequence组合,序列可能不是从1(也不是0)开始
虽然可以运行查询以查找缺失的序列,例如the answer to this question上的变体,但这可能会花费很长时间
此方法的替代方法是在将数据插入表(使用Pentaho数据集成工具)之前或同时执行一些插入/更新查询,并使用包含缺少的client_id / sequence值的附加表。这意味着在上面的示例中,当插入(client_id,sequence)值(1001,5)时,将使用类似我在下面找到的查询之类的内容来拾取序列2-4丢失:
CREATE TABLE `missing_sequences` (
`client_id` int(11),
`miss_start` int(11),
`miss_end` int(11),
)
(注意,为了更容易在SQL中测试查询而不是Pentaho执行SQL语句,插入被注释掉,以便它只是一个选择)
SET @temp_id = 1001;
SET @temp_seq = 5;
/* Replace temp_id, temp_seq references with ? in Pentaho */
/* INSERT INTO missing_sequences (id, miss_start, miss_end) */
SELECT @temp_id id, max(t1.seq) + 1 missing_start, @temp_seq - 1 missing_end
FROM client_trans t1
CROSS JOIN client_trans t2
WHERE t1.id = @temp_id
AND t1.seq < @temp_seq
AND t2.id = @temp_id
AND t2.seq >= @temp_seq - 1
HAVING missing_end >= missing_start
结果:
id missing_start missing_end
1001 2 4
这将成功地填充缺失的序列表,但是当添加包含以前缺失的序列之一的行时会出现问题。
(最初我还有基于client_id和miss_start的主索引,它也会处理添加的重复值,但不完全确定这是否正确)
根据插入的序列号存在四种可能性之一,例如:
@temp_seq = missing_start : (@temp_seq = 2)
update missing_start += 1
missing_start < @temp_seq < missing_end : (@temp_seq = 3)
split into two records
@temp_seq = missing_end : (@temp_seq = 4)
update missing_end -= 1
@temp_seq = missing_start = missing_end : (@temp_id = 1002, @temp_seq = 5)
delete record from missing_sequences table
现在我的问题出现了(如果您考虑到导入的数据可能没有排序,则更早): 我如何满足Pentaho数据集成转换中的每种可能性以及初始插入和重复?
编辑:经过一番头脑风暴后,我想出了以下在MySQL中运行它时似乎正在运行的脚本,但是当它作为“执行SQL语句”触发器运行时却没有。这是在(client_id,missing_start)的missing_sequences表上有一个主索引:SET @orig_start = 0;
SET @orig_end = 0;
SET @temp_client_id = ?;
SET @temp_sequence = ?;
/* Find closest matching record and save start/end values*/
SELECT client_id, @orig_start:=miss_start miss_start, @orig_end:=miss_end miss_end
FROM missing_sequences
WHERE client_id = @temp_client_id
AND miss_start <= @temp_sequence
AND miss_end >= @temp_sequence
LIMIT 1; /* Just in case, delete all matches later anyway */
/* Delete the above record if exists */
DELETE FROM missing_sequences
WHERE client_id = @temp_client_id AND miss_start = @orig_start AND miss_end = @orig_end;
/* Insert new value. This will insert the FIRST value in the table
eg. if 1-10 is missing and 5 inserted, this will insert 1-4 as missing */
INSERT INTO missing_sequences (client_id, miss_start, miss_end)
SELECT @temp_client_id client_id, @curr_start := max(t1.sequence) + 1 miss_start, @curr_end := @temp_sequence - 1 miss_end
FROM client_trans t1
CROSS JOIN client_trans t2
WHERE t1.client_id = @temp_client_id
AND t1.sequence < @temp_sequence
AND t2.client_id = @temp_client_id
AND t2.sequence >= @temp_sequence - 1
HAVING miss_end >= miss_start
ON DUPLICATE KEY UPDATE client_id = @temp_client_id,miss_start = @curr_start;
/* Insert upper missing value if it is different */
INSERT INTO missing_sequences (client_id, miss_start, miss_end)
SELECT @temp_client_id client_id, @curr_end + 2 missing_start, @orig_end missing_end
FROM dual
WHERE @curr_end + 2 <= @orig_end
ON DUPLICATE KEY UPDATE client_id = @temp_client_id,miss_start = @curr_start;
对每一行执行并检查变量替换框,但执行似乎不一致或根本不更新缺失的序列表