PostgreSQL群由Unicode字符串上的bug?

时间:2013-04-05 04:32:39

标签: unicode group-by greenplum

我发生了一件非常奇怪的事情,我注意到如果该单词是UTF-8字符串,group by (word)并不总是按字分组。在同一个查询中,我得到了正确分组的情况,以及没有正确分组的情况。我想知道是否有人知道这是怎么回事?

select *,count(*) over (partition by md5(word)) as k
from (
  select word,count(*) as n
  from :tmpwl
  group by 1
) a order by 1,2 limit 12;
/* gives:
 word | n | k 
------+---+---
 いい | 1 | 1
 くず | 1 | 1
 ごみ | 1 | 1
 さま | 1 | 1
 さん | 1 | 1
 へま | 1 | 1
 まめ | 1 | 1
 よく | 1 | 1
 ろく | 1 | 1
 ネガ | 1 | 2   -- what the heck?
 ネガ | 1 | 2
 パス | 1 | 1
*/

请注意,以下解决方法可以正常工作:

select word,n,count(*) over (partition by md5(word)) as k
from (
  select md5(word),max(word) as word,count(*) as n
  from :tmpwl
  group by 1
) a order by 1,2 limit 12;
/* gives:
 word | n | k 
------+---+---
 いい | 1 | 1
 くず | 1 | 1
 ごみ | 1 | 1
 さま | 1 | 1
 さん | 1 | 1
 へま | 1 | 1
 まめ | 1 | 1
 よく | 1 | 1
 ろく | 1 | 1
 ネガ | 2 | 1
 パス | 1 | 1
 プア | 1 | 1
*/

该版本是x86_64-unknown-linux-gnu上的PostgreSQL 8.2.14(Greenplum Database 4.0.4.0 build 3单节点版),由GCC gcc.exe(GCC)4.1.1编译于2010年11月30日编译17 :20:26

源表:tmpwl

\d :tmpwl
Table "pg_temp_25149.pdtmp_foo706453357357532"
  Column  |  Type   | Modifiers 
----------+---------+-----------
 baseword | text    | 
 word     | text    | 
 value    | integer | 
 lexicon  | text    | 
 nalts    | bigint  | 
Distributed by: (word)

0 个答案:

没有答案