Redshift Serializable Isolation违规错误消息事务ID错误?

时间:2021-07-27 06:12:49

标签: database concurrency amazon-redshift

我正在尝试识别违反 Redshift 上可序列化隔离的事务 例如

ERROR:  1023
DETAIL:  Serializable isolation violation on table - 4117431, transactions forming the cycle are: 246544535, 246540473 (pid:1777)

为了更好地理解这一点,我在这里使用了 AWS 文档中的玩具示例:https://docs.aws.amazon.com/redshift/latest/dg/c_serial_isolation.html#c_serial_isolation-serializable-isolation-troubleshooting

错误消息似乎包含一个事务 ID,它不是我当前正在运行的并发事务之一。我是不是误会了什么?

我做了 2 个实验来证实这一点:

实验 1

事务 1 (T1) - 用户:user_a

mydb=> begin;
BEGIN
mydb=*> select * from test.sl;
 id
----
  1
  3
  7
  2
(4 rows)

mydb=*> insert into test.sl2 values (7);
INSERT 0 1
mydb=*> end;
COMMIT

事务 2 (T2) - 用户:user_b

mydb=# begin;
BEGIN
mydb=*# select * from test.sl2;
 id
----
 11
  3
  9
  8
(4 rows)

mydb=*# insert into test.sl values (6);
ERROR:  1023
DETAIL:  Serializable isolation violation on table - 4117431, transactions forming the cycle are: 246544535, 246540473 (pid:1777)
mydb=!# end;

调试

mydb=# select xid, 
              pid,
              starttime,
              endtime,
              sequence, 
              case 
                when xid in (select xact_id from stl_tr_conflict) then 1 
                else 0 
              end as aborted, 
              trim(text) as text 
              from svl_statementtext 
              where xid in (246544535, 246540473) order by xid, sequence, starttime;

    xid    |  pid  |         starttime          |          endtime           | sequence | aborted |               text
-----------+-------+----------------------------+----------------------------+----------+---------+-----------------------------------
 246540473 | 31342 | 2021-07-26 10:02:35.975449 | 2021-07-26 10:02:35.975451 |        0 |       0 | begin;
 246540473 | 31342 | 2021-07-26 10:02:40.219189 | 2021-07-26 10:02:40.713895 |        0 |       0 | select * from test.sl;
 246540473 | 31342 | 2021-07-26 10:03:02.616113 | 2021-07-26 10:03:02.628287 |        0 |       0 | insert into test.sl2 values (11);
 246540473 | 31342 | 2021-07-26 10:03:32.585407 | 2021-07-26 10:03:33.036425 |        0 |       0 | COMMIT
 246544535 |  1777 | 2021-07-26 10:14:40.687421 | 2021-07-26 10:14:40.687423 |        0 |       1 | begin;
 246544535 |  1777 | 2021-07-26 10:15:46.711658 | 2021-07-26 10:15:46.71843  |        0 |       1 | select * from test.sl2;
 246544535 |  1777 | 2021-07-26 10:16:03.639541 | 2021-07-26 10:16:03.6423   |        0 |       1 | insert into test.sl values (6);
(7 rows)

我已经看到 xid = 246540473 不是并发事务之一(T1 或 T2)。

所以我再次测试了它。

实验 2

T1 - 用户:user_a

mydb=> begin;
BEGIN
mydb=*> select * from test.sl;
 id
----
  2
  1
  3
  7
(4 rows)

mydb=*> insert into test.sl2 values (12);
INSERT 0 1
mydb=*>

T2 - 用户:user_b

mydb=# begin;
BEGIN
mydb=*# select * from test.sl2;
 id
----
  8
  3
  9
 11
  7
(5 rows)

mydb=*# insert into test.sl values (13);
ERROR:  1023
DETAIL:  Serializable isolation violation on table - 4117431, transactions forming the cycle are: 246549376, 246544529 (pid:6733)
mydb=!#

不过这一次,我在结束两个交易之前通过查询 svv_transactions 并查找 txn_owner 来记录交易 ID。

mydb=# select * from svv_transactions where txn_owner in ('user_b', 'user_a') limit 10;
 txn_owner | txn_db  |    xid    | pid  |         txn_start          |    lock_mode    | lockable_object_type | relation | granted
-----------+---------+-----------+------+----------------------------+-----------------+----------------------+----------+---------
 user_a    | mydb    | 246549373 | 6727 | 2021-07-26 10:46:20.116482 | AccessShareLock | relation             |   252024 | t
 user_a    | mydb    | 246549373 | 6727 | 2021-07-26 10:46:20.116482 | AccessShareLock | relation             |  4117431 | t
 user_a    | mydb    | 246549373 | 6727 | 2021-07-26 10:46:20.116482 | ExclusiveLock   | transactionid        |          | t
 user_b    | mydb    | 246549376 | 6733 | 2021-07-26 10:46:23.702597 | AccessShareLock | relation             |   252024 | t
 user_b    | mydb    | 246549376 | 6733 | 2021-07-26 10:46:23.702597 | AccessShareLock | relation             |  4117498 | t
 user_b    | mydb    | 246549376 | 6733 | 2021-07-26 10:46:23.702597 | ExclusiveLock   | transactionid        |          | t

我看到实验 2 中的交易 ID 是 246549373246549376。 错误消息为我提供了 246549376,这是有道理的。 但是第二个 id 246544529 没有。 -- 那是来自实验 1。

mydb=# select xid, 
              pid, 
              starttime, 
              endtime, 
              sequence, 
              case 
                when xid in (select xact_id from stl_tr_conflict) then 1 
                else 0 
              end as aborted, 
              trim(text) as text 
              from svl_statementtext 
              where xid in (246549376, 246544529, 246549373) 
              order by xid, starttime, sequence;
    xid    | pid  |         starttime          |          endtime           | sequence | aborted |               text
-----------+------+----------------------------+----------------------------+----------+---------+-----------------------------------
 246544529 | 1779 | 2021-07-26 10:14:37.052255 | 2021-07-26 10:14:37.052257 |        0 |       0 | begin;
 246544529 | 1779 | 2021-07-26 10:15:43.173474 | 2021-07-26 10:15:43.185421 |        0 |       0 | select * from test.sl;
 246544529 | 1779 | 2021-07-26 10:15:56.973818 | 2021-07-26 10:15:56.986552 |        0 |       0 | insert into test.sl2 values (7);
 246544529 | 1779 | 2021-07-26 10:16:42.137115 | 2021-07-26 10:16:42.674209 |        0 |       0 | COMMIT
 246549373 | 6727 | 2021-07-26 10:44:37.179593 | 2021-07-26 10:44:37.179594 |        0 |       0 | begin;
 246549373 | 6727 | 2021-07-26 10:46:20.119846 | 2021-07-26 10:46:20.352005 |        0 |       0 | select * from test.sl;
 246549373 | 6727 | 2021-07-26 10:47:00.662191 | 2021-07-26 10:47:00.674989 |        0 |       0 | insert into test.sl2 values (12);
 246549376 | 6733 | 2021-07-26 10:44:38.798094 | 2021-07-26 10:44:38.798095 |        0 |       1 | begin;
 246549376 | 6733 | 2021-07-26 10:46:23.705674 | 2021-07-26 10:46:23.715201 |        0 |       1 | select * from test.sl2;
 246549376 | 6733 | 2021-07-26 10:47:07.167762 | 2021-07-26 10:47:07.17054  |        0 |       1 | insert into test.sl values (13);
(10 rows)

为什么它不向我提供 246549373?我有什么不明白的?

参考文献:

1 个答案:

答案 0 :(得分:0)

我在新集群上运行了您的代码,并使用您的调试 SQL 在错误消息中获得了您对事务的期望。但是,重新运行后,即使此事务已回滚且没有锁定,我也会得到较早的 PID 作为冲突。

断开会话对此没有影响。重启集群并不能解决这个问题,但会让事情变得更奇怪 - 错误消息只会报告一个事务 ID!

您似乎在 Redshift 的错误消息系统中发现了一个错误,您需要提交工单。数据库运行正常 - 检测和中止创建循环锁的条件 - 但当相同的表再次运行时,错误消息生成器会感到困惑。这些非常罕见,我希望它们没有针对发生的多个序列化错误的发布测试用例。

好球。