GROUP BY组不相等的值

时间:2013-09-03 16:55:51

标签: mysql group-by

在Debian上使用MySQL 5.1.66-0 + squeeze1-log,我得到了一个我不明白的GROUP BY结果。

如果我是GROUP BY data,则会合并不等data个值,这对我没有任何意义。如果我在同一列SHA1(data)上对一个哈希值进行GROUP BY,那么一切正常,只有data的{​​{1}}值相等。

这里发生了什么?几乎看起来GROUP BY只考虑列的前x个字符。如果不是这样,为什么会发生这种情况呢?它可能只是我脑中的一个扭曲吗?

编辑1: data的一个示例值(json编码的遗产 - 当我甚至是笨蛋时​​回来;)):

{"a":[{"val":{"tcn":{"1980":"1","1981":"1","1982":"1","1983":"1","1984":"1","1985":"1","1986":"1","1987":"1","1988":"1","1989":"1","1990":"1","1991":"1","1992":"1","1993":"1","1994":"1","1995":"1","1996":"1","1997":"1","1998":"1","1999":"1","2000":"1","2001":"1","2002":"1","2003":"1","2004":"1","2005":"1","2006":"1","2007":"1","2008":"1","2009":"1","2010":"1"},"sic":{"1980":"1","1981":"1","1982":"1","1983":"1","1984":"1","1985":"1","1986":"1","1987":"1","1988":"1","1989":"1","1990":"1","1991":"1","1992":"1","1993":"1","1994":"1","1995":"1","1996":"1","1997":"1","1998":"1","1999":"1","2000":"1","2001":"1","2002":"1","2003":"1","2004":"1","2005":"1","2006":"1","2007":"1","2008":"1","2009":"1","2010":"1"}}}],"b":[{"val":{"tcn":{"1980":"1","1981":"1","1982":"1","1983":"1","1984":"1","1985":"1","1986":"1","1987":"1","1988":"1","1989":"1","1990":"1","1991":"1","1992":"1","1993":"1","1994":"1","1995":"1","1996":"1","1997":"1","1998":"1","1999":"1","2000":"1","2001":"1","2002":"1","2003":"1","2004":"1","2005":"1","2006":"1","2007":"1","2008":"1","2009":"1","2010":"1"},"sic":{"1980":"1","1981":"1","1982":"1","1983":"1","1984":"1","1985":"1","1986":"1","1987":"1","1988":"1","1989":"1","1990":"1","1991":"1","1992":"1","1993":"1","1994":"1","1995":"1","1996":"1","1997":"1","1998":"1","1999":"1","2000":"1","2001":"1","2002":"1","2003":"1","2004":"1","2005":"1","2006":"1","2007":"1","2008":"1","2009":"1","2010":"1"}}}],"0":[{"val":{"com":{"able":"2"}},"str":{"com":{"comm":"According","src":{"1":{"name":"law 256","articles":"B2\/2.11","links":"","type":""},"2":{"name":"law 298","articles":"B.19\/2.3","links":"","type":""}}}}}]}

编辑2: 很抱歉遗漏了代码,我认为这会让它更短更容易。显然情况恰恰相反......

SELECT
    GROUP_CONCAT(resid) AS ids
    ,data
FROM resdata
GROUP BY data

VS

SELECT
    GROUP_CONCAT(resid) AS ids
    ,CAST(SHA1(data) AS CHAR(40)) AS hash
    ,data
FROM resdata
GROUP BY hash

1 个答案:

答案 0 :(得分:1)

我终于明白了。问题仅在GROUP_CONCAT()存在时发生,如GROUP_CONCAT() row count when grouping by a text field中所讨论的那样(我在找出它与concat链接后才发现)。

ORDER BY,DISTINCT和(间接)GROUP_CONCAT()都依赖于max_sort_length系统变量。任何使用这些运算符/函数的查询都只会考虑列的第一个max_sort_length个字节,在我的例子中是默认的1024个字节。

虽然GROUP BY不使用ORDER BY,但GROUP_CONCAT()默认情况下在GROUP BY语句中使用的列上使用ORDER BY。 (感谢Saharsh ShahJan 4 at 12:42

我的data列中的大多数值都比max_sort_length长得多。在我的例子中,有377行,其中前1024个字节是相同的,但其余的不同。因此,在我的情况下,DISTINCT和GROUP BY将只返回2360行,即使有2737个不同的值。

所以要小心使用长于max_sort_length的文本对文本列进行分组可能不表示在INT和较小的CHAR上操作时使用的不同结果。 DISTINCT将显示相同的行为,当使用它来检查GROUP BY的完整性时会给你一个误报。