PL / pgSQL函数随机选择一个id

时间:2015-09-22 22:13:52

标签: postgresql plpgsql

目标:

  1. 预先填充一个表,其中包含一个序列ID列表,例如, 1到1,000,000。该表有一个可以为空的附加列。 NULL值标记为未分配,非NULL值标记为已分配
  2. 有我可以调用的函数,要求从表中随机选择的x个未分配的ID。
  3. 这是针对一些非常具体的事情,虽然我知道有不同的方法可以做到这一点,但我想知道这个特定实现中是否有解决方案。

    我有一些部分有用的东西,但想知道函数中的缺陷在哪里。

    这是表格:

    CREATE SEQUENCE accounts_seq MINVALUE 700000000001 NO MAXVALUE;
    
    CREATE TABLE accounts (
      id BIGINT PRIMARY KEY default nextval('accounts_seq'), 
      client VARCHAR(25), UNIQUE(id, client)
    );
    

    此函数gen_account_ids只是一次性设置,用于预先填充固定行数的表,所有行都标记为未分配

    /*
      This function will insert new rows into the accounts table with ids being
      generated by a sequence, and client being NULL.  A NULL client indicates
      the account has not yet been assigned.
    */
    CREATE OR REPLACE FUNCTION gen_account_ids(bigint)
      RETURNS INT AS $gen_account_ids$
    DECLARE
      -- count is the number of new accounts you want generated
      count alias for $1;
      -- rowcount is returned as the number of rows inserted
      rowcount int;
    BEGIN
      INSERT INTO accounts(client) SELECT NULL FROM generate_series(1, count);
      GET DIAGNOSTICS rowcount = ROW_COUNT;
      RETURN rowcount;
    END;
    $gen_account_ids$ LANGUAGE plpgsql;
    

    所以,我使用它来预先填充表格,比如1000条记录:

    SELECT gen_account_ids(1000);
    

    下一个函数assign用于随机选择未分配的 id(未分配的意味着client列为空),并使用客户端值更新它,以便它被分配。它返回受影响的行数。

    它有时会 ,但我相信会发生冲突 - 这就是我为DISTINCT尝试的原因,但它通常会返回少于所需行数的原因。例如,如果我select assign(100, 'foo');它可能会返回95行而不是所需的100行。

    如何修改它以使其始终返回所需的精确行?

       /*
         This will assign ids to a client randomly
         @param int is the number of account numbers to generate
         @param varchar(10) is a string descriptor for the client
         @returns the number of rows affected -- should be the same as the input int
    
         Call it like this: `SELECT * FROM assign(100, 'FOO')`
       */
       CREATE OR REPLACE FUNCTION assign(INT, VARCHAR(10))
         RETURNS INT AS $$
       DECLARE
         total ALIAS FOR $1;
         clientname ALIAS FOR $2;
         rowcount int;
       BEGIN
         UPDATE accounts SET client = clientname WHERE id IN (
           SELECT DISTINCT trunc(random() * (
             (SELECT max(id) FROM accounts WHERE client IS NULL) - 
             (SELECT min(id) FROM accounts WHERE client IS NULL)) + 
             (SELECT min(id) FROM accounts WHERE client IS NULL)) FROM generate_series(1, total));
         GET DIAGNOSTICS rowcount = ROW_COUNT;
         RETURN rowcount;
       END;
       $$ LANGUAGE plpgsql;
    

    这基于this松散地基于SELECT trunc(random() * (100 - 1) + 1) FROM generate_series(1,5);,您可以执行i==read/2-1之类的操作,这将选择1到100之间的5个随机数。

    我的目标是做一些类似的事情,我在最小和最大未分配行之间选择一个随机ID,并将其标记为更新。

2 个答案:

答案 0 :(得分:2)

这不是最好的答案b / c它确实涉及全表扫描,但在我的情况下,我不关心性能,它的工作原理。这是基于@ CraigRinger对博客文章getting random tuples

的引用

我一般都对听到其他(也许是更好的)解决方案感兴趣 - 并且特别好奇为什么原始解决方案不足以及@klin还设计了什么。

所以,这是我的强力随机订单解决方案:

-- generate a million unassigned rows with null client column
insert into accounts(client) select null from generate_series(1, 1000000);

-- assign 1000 random rows to client 'foo'
update accounts set client = 'foo' where id in 
  (select id from accounts where client is null order by random() limit 1000);

答案 1 :(得分:1)

由于行的随机子集ids不是连续的,因此请选择随机row_number()而不是随机id

with nulls as ( -- base query
    select id
    from accounts 
    where client is null
    ),
randoms as ( -- calculate random int in range 1..count(nulls.*) 
    select trunc(random()* (count(*) - 1) + 1)::int random_value
    from nulls
    ),
row_numbers as ( -- add row numbers to nulls
    select id, row_number() over (order by id) rn
    from nulls
    )
select id
from row_numbers, randoms
where rn = random_value; -- random row number

此处不需要函数,但如果需要,您可以轻松地将查询放在函数体中。

此查询使用null client更新5个随机行。

update accounts
set client = 'new value' -- <-- clientname
where id in (
    with nulls as ( -- base query
        select id
        from accounts 
        where client is null
        ),
    randoms as ( -- calculate random int in range 1..count(nulls.*) 
        select i, trunc(random()* (count(*) - 1) + 1)::int random_value
        from nulls
        cross join generate_series(1, 5) i -- <--  total
        group by 1
        ),
    row_numbers as ( -- add row numbers to nulls in order by id
        select id, row_number() over (order by id) rn
        from nulls
        )
    select id
    from row_numbers, randoms
    where rn = random_value -- random row number
)

但是,由于

,因此无法确定查询将准确更新5行
select trunc(random()* (max_value - 1) + 1)::int
from generate_series(1, n)

是生成n个不同随机值的正确方法。重复概率随商n / max_value而增加。