我有一个SQL表,我在其中存储必须唯一的大字符串值。 为了确保唯一性,我在列上有一个唯一索引,我在其中存储大字符串的MD5哈希的字符串表示。
保存这些记录的C#应用程序使用以下方法进行散列:
public static string CreateMd5HashString(byte[] input)
{
var hashBytes = MD5.Create().ComputeHash(input);
return string.Join("", hashBytes.Select(b => b.ToString("X")));
}
为了调用此功能,我首先使用UTF-8编码将string
转换为byte[]
:
// this is what I use in my app
CreateMd5HashString(Encoding.UTF8.GetBytes("abc"))
// result: 90150983CD24FB0D6963F7D28E17F72
现在我希望能够使用HASHBYTES
function在SQL中实现此哈希函数,但我得到了不同的值:
print hashbytes('md5', N'abc')
-- result: 0xCE1473CF80C6B3FDA8E3DFC006ADC315
这是因为SQL计算字符串的UTF-16表示的MD5。
如果我CreateMd5HashString(Encoding.Unicode.GetBytes("abc"))
,我会在C#中得到相同的结果。
我无法改变在应用程序中完成散列的方式。
有没有办法让SQL Server计算字符串的UTF-8字节的MD5哈希?
我查了类似的问题,我尝试使用整理,但到目前为止没有运气。
答案 0 :(得分:8)
您需要创建一个UDF,以便将UTARCHAR数据转换为UTF-8表示形式的字节。假设它被称为dbo.NCharToUTF8Binary
,那么你可以这样做:
hashbytes('md5', dbo.NCharToUTF8Binary(N'abc', 1))
这是一个UDF,它会这样做:
create function dbo.NCharToUTF8Binary(@txt NVARCHAR(max), @modified bit)
returns varbinary(max)
as
begin
-- Note: This is not the fastest possible routine.
-- If you want a fast routine, use SQLCLR
set @modified = isnull(@modified, 0)
-- First shred into a table.
declare @chars table (
ix int identity primary key,
codepoint int,
utf8 varbinary(6)
)
declare @ix int
set @ix = 0
while @ix < datalength(@txt)/2 -- trailing spaces
begin
set @ix = @ix + 1
insert @chars(codepoint)
select unicode(substring(@txt, @ix, 1))
end
-- Now look for surrogate pairs.
-- If we find a pair (lead followed by trail) we will pair them
-- High surrogate is \uD800 to \uDBFF
-- Low surrogate is \uDC00 to \uDFFF
-- Look for high surrogate followed by low surrogate and update the codepoint
update c1 set codepoint = ((c1.codepoint & 0x07ff) * 0x0800) + (c2.codepoint & 0x07ff) + 0x10000
from @chars c1 inner join @chars c2 on c1.ix = c2.ix -1
where c1.codepoint >= 0xD800 and c1.codepoint <=0xDBFF
and c2.codepoint >= 0xDC00 and c2.codepoint <=0xDFFF
-- Get rid of the trailing half of the pair where found
delete c2
from @chars c1 inner join @chars c2 on c1.ix = c2.ix -1
where c1.codepoint >= 0x10000
-- Now we utf-8 encode each codepoint.
-- Lone surrogate halves will still be here
-- so they will be encoded as if they were not surrogate pairs.
update c
set utf8 =
case
-- One-byte encodings (modified UTF8 outputs zero as a two-byte encoding)
when codepoint <= 0x7f and (@modified = 0 OR codepoint <> 0)
then cast(substring(cast(codepoint as binary(4)), 4, 1) as varbinary(6))
-- Two-byte encodings
when codepoint <= 0x07ff
then substring(cast((0x00C0 + ((codepoint/0x40) & 0x1f)) as binary(4)),4,1)
+ substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1)
-- Three-byte encodings
when codepoint <= 0x0ffff
then substring(cast((0x00E0 + ((codepoint/0x1000) & 0x0f)) as binary(4)),4,1)
+ substring(cast((0x0080 + ((codepoint/0x40) & 0x3f)) as binary(4)),4,1)
+ substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1)
-- Four-byte encodings
when codepoint <= 0x1FFFFF
then substring(cast((0x00F0 + ((codepoint/0x00040000) & 0x07)) as binary(4)),4,1)
+ substring(cast((0x0080 + ((codepoint/0x1000) & 0x3f)) as binary(4)),4,1)
+ substring(cast((0x0080 + ((codepoint/0x40) & 0x3f)) as binary(4)),4,1)
+ substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1)
end
from @chars c
-- Finally concatenate them all and return.
declare @ret varbinary(max)
set @ret = cast('' as varbinary(max))
select @ret = @ret + utf8 from @chars c order by ix
return @ret
end
答案 1 :(得分:1)
SQL Server本身不支持使用UTF-8字符串和it hasn't for quite a while。如您所见,NCHAR and NVARCHAR use UCS-2 rather than UTF-8。
如果您坚持使用HASHBYTES
功能,则必须能够从C#代码传递UTF-8 byte[]
作为VARBINARY
以保留编码。 HASHBYTES
accepts VARBINARY
in place of NVARCHAR
.这可以通过接受NVARCHAR
的CLR函数完成,并将Encoding.UTF8.GetBytes
的结果作为VARBINARY
返回。
话虽如此,我强烈建议在应用程序而不是数据库中保持这些类型的业务规则。特别是因为应用程序已经在执行此逻辑。
答案 2 :(得分:0)
SELECT HashBytes('MD5', CAST (N'中文' COLLATE Latin1_General_100_CI_AI_SC_UTF8 AS varchar(4000)))
仅在sql server 2019上
参考: https://www.mssqltips.com/sqlservertip/6168/impact-of-utf8-support-in-sql-server-2019/