Question

我遇到了导致死锁问题的架构和upsert存储过程。我有一个大概的想法，为什么这会导致死锁以及如何解决它。我可以重现它，但我没有清楚地了解导致它的步骤顺序。如果有人能够清楚地解释为什么会造成僵局，那就太棒了。

这是架构和存储过程。此代码正在PostgreSQL 9.2.2上执行。

CREATE TABLE counters (                                                                                                                                                                                                                       
  count_type INTEGER NOT NULL,
  count_id   INTEGER NOT NULL,
  count      INTEGER NOT NULL
);


CREATE TABLE primary_relation (
  id INTEGER PRIMARY KEY,
  a_counter INTEGER NOT NULL DEFAULT 0
);

INSERT INTO primary_relation
SELECT i FROM generate_series(1,5) AS i;

CREATE OR REPLACE FUNCTION increment_count(ctype integer, cid integer, i integer) RETURNS VOID
AS $$
BEGIN
    LOOP
        UPDATE counters
         SET count = count + i 
         WHERE count_type = ctype AND count_id = cid;
         IF FOUND THEN
            RETURN;
          END IF; 
        BEGIN
            INSERT INTO counters (count_type, count_id, count)
             VALUES (ctype, cid, i); 
            RETURN;
        EXCEPTION WHEN OTHERS THEN
        END;
    END LOOP;
END;
$$
LANGUAGE PLPGSQL;

CREATE OR REPLACE FUNCTION update_primary_a_count(ctype integer) RETURNS VOID
AS $$
  WITH deleted_counts_cte AS (
      DELETE
      FROM counters
      WHERE count_type = ctype
      RETURNING *
  ), rollup_cte AS (
      SELECT count_id, SUM(count) AS count
      FROM deleted_counts_cte
      GROUP BY count_id
      HAVING SUM(count) <> 0
  )
  UPDATE primary_relation
  SET a_counter = a_counter + rollup_cte.count
  FROM rollup_cte
  WHERE primary_relation.id = rollup_cte.count_id
$$ LANGUAGE SQL;

这是一个重现死锁的python脚本。

import os                                                                                                                                                                                                                                     
import random
import time
import psycopg2

COUNTERS = 5 
THREADS = 10
ITERATIONS = 500 

def increment():
  outf = open('synctest.out.%d' % os.getpid(), 'w')
  conn = psycopg2.connect(database="test")
  cur = conn.cursor()
  for i in range(0,ITERATIONS):
    time.sleep(random.random())
    start = time.time()
    cur.execute("SELECT increment_count(0, %s, 1)", [random.randint(1,COUNTERS)])
    conn.commit()
    outf.write("%f\n" % (time.time() - start))
  conn.close()
  outf.close()

def update(n):
  outf = open('synctest.update', 'w')
  conn = psycopg2.connect(database="test")
  cur = conn.cursor()
  for i in range(0,n):
    time.sleep(random.random())
    start = time.time()
    cur.execute("SELECT update_primary_a_count(0)")
    conn.commit()
    outf.write("%f\n" % (time.time() - start))
  conn.close()

pids = []
for i in range(THREADS):
  pid = os.fork()
  if pid != 0:
    print 'Process %d spawned' % pid 
    pids.append(pid)
  else:
    print 'Starting child %d' % os.getpid()
    increment()
    print 'Exiting child %d' % os.getpid()
    os._exit(0)

update(ITERATIONS)
for pid in pids:
  print "waiting on %d" % pid 
  os.waitpid(pid, 0)

# cleanup
update(1)

我认识到这个问题的一个问题是upsert会产生重复的行（有多个编写器），这可能会导致一些重复计数。但为什么这会导致僵局？

我从PostgreSQL获得的错误如下：

process 91924 detected deadlock while waiting for ShareLock on transaction 4683083 after 100.559 ms",,,,,"SQL statement ""UPDATE counters

客户吐出这样的东西：

psycopg2.extensions.TransactionRollbackError: deadlock detected
DETAIL:  Process 91924 waits for ShareLock on transaction 4683083; blocked by process 91933.
Process 91933 waits for ShareLock on transaction 4683079; blocked by process 91924.
HINT:  See server log for query details.CONTEXT:  SQL statement "UPDATE counters
         SET count = count + i
         WHERE count_type = ctype AND count_id = cid"
PL/pgSQL function increment_count(integer,integer,integer) line 4 at SQL statement

要解决此问题，您需要添加如下主键：

ALTER TABLE counters ADD PRIMARY KEY (count_type, count_id);

非常感谢任何见解。谢谢！

Answer 1

由于主键，此表中的行数始终为＆lt; =＃threads，主键确保不会重复行。

当您删除主键时，某些线程滞后并且行数增加，同时行重复。当行重复时，更新时间会更长，2个或更多线程将尝试更新相同的行。

打开一个新终端并输入：

watch --interval 1 "psql -tc \"select count(*) from counters\" test"

使用和不使用主键尝试此操作。当您遇到第一个死锁时，请查看上面查询的结果。在我的情况下，这就是我在表计数器中留下的内容：

test=# select * from counters order by 2;
 count_type | count_id | count 
------------+----------+-------
          0 |        1 |   735
          0 |        1 |   733
          0 |        1 |   735
          0 |        1 |   735
          0 |        2 |   916
          0 |        2 |   914
          0 |        2 |   914
          0 |        3 |   882
          0 |        4 |   999
          0 |        5 |   691
          0 |        5 |   692
(11 rows)

Answer 2

您的代码是竞争条件的完美配方（多线程，随机睡眠）。问题很可能是由于锁定问题，因为您没有提到锁定模式，我将假设这是一个基于页面的锁定，因此，您会得到以下情况：

线程1启动，它开始插入记录，让我们说它锁定页面n°1，并应锁定第2页。
线程2在1的同时启动，但它锁定了第一页2，并且应该锁定第1页。
两个线程现在都在等待彼此完成，所以你有一个死锁。

现在，为什么PK会修复它？

因为锁定首先是通过索引完成的，所以竞争条件得到缓解，因为PK在插入时是唯一的，因此所有线程都在等待索引，并且在更新中通过索引完成访问，因此记录被锁定基于它的PK。

Answer 3

在某个时刻，一个用户正在等待另一个用户拥有的锁，而第一个用户拥有第二个用户想要的锁。这是造成僵局的原因。

猜测，这是因为当您在增量sp中更新计数器时，如果没有主键（或实际上任何键），则必须读取整个表。与primary_relation表相同。这将会留下锁定，并为僵局开辟道路。我不是Postgres的用户，所以我不知道究竟什么时候放置锁的细节，但我很确定这就是发生的事情。

在计数器上放置PK可以使数据库以准确读取的行为目标，并将最小数量的锁定。你也应该在primary_relation上有一个PK！

为什么缺少主键/唯一键会导致upsert出现死锁问题？

3 个答案: