PostgreSQL shuffle列值

时间:2015-11-05 21:54:51

标签: sql performance postgresql shuffle

在包含>的表格中100k行,我怎样才能有效地改变特定列的值?

表格定义:

CREATE TABLE person
(
  id integer NOT NULL,
  first_name character varying,
  last_name character varying,
 CONSTRAINT person_pkey PRIMARY KEY (id)
)

为了匿名化数据,我必须改变' first_name'的值。列到位(我不允许创建新表)。

我的尝试:

with
first_names as (
select row_number() over (order by random()),
       first_name as new_first_name
from person
),
ids as (
select row_number() over (order by random()), 
       id as ref_id
from person
)
update person
set first_name = new_first_name
from first_names, ids
where id = ref_id;

完成需要几个小时。

有没有一种有效的方法呢?

2 个答案:

答案 0 :(得分:5)

postgres的问题是每次更新都意味着delete + insert

  • 您可以使用SELECT代替UPDATE查看分析,了解CTE的效果
  • 您可以关闭索引以便更新更快
  • 但是我需要更新所有行时使用的最佳解决方案是再次创建表

CREATE TABLE new_table AS 
     SELECT * ....


DROP oldtable;

Rename new_table to old_table

CREATE index and constrains

很抱歉,这不是一个选择:(

编辑:阅读a_horse_with_no_name

看起来像你需要

with
first_names as (
    select row_number() over (order by random()) rn,
           first_name as new_first_name
    from person
),
ids as (
    select row_number() over (order by random()) rn, 
           id as ref_id
    from person
)
update person
set first_name = new_first_name
from first_names
join ids
  on first_names.rn = ids.rn
where id = ref_id;

如果您提供ANALYZE / EXPLAIN结果,那么效果问题会更好。

答案 1 :(得分:4)

这个需要5秒钟在我的笔记本电脑上洗牌500.000行:

with names as (
  select id, first_name, last_name,
         lead(first_name) over w as first_1,
         lag(first_name) over w as first_2
  from person
  window w as (order by random())
)
update person
  set first_name = coalesce(first_1, first_2)
from names 
where person.id = names.id;

我们的想法是在随机排序数据后选择“下一个”名称。这与选择随机名称一样好。

有可能并非所有名字都被洗牌,但如果你运行两三次,这应该足够了。

以下是SQLFiddle上的测试设置:http://sqlfiddle.com/#!15/15713/1

右侧的查询检查“随机化”

后是否有任何名字保持不变