清理MSSQL sql dump

时间:2015-09-25 16:16:45

标签: sql-server regex encoding

我支持从MSSQL迁移到Postgres。我只是中间人,并没有实际访问MSSQL服务器,这不会让事情变得更容易。我能够说服MSSQL人员进行SQL导出(而不是.bak文件),并且已经知道如何处理文件中特定于MS的怪异。但是,该文件在每行的末尾也有一大堆垃圾字符,如下所示:

 INSERT [dbo].[Client_Balances] ([client_id], [Total Invoices], [Total Debits], [Total Payments], [Total Credits], [Balance Forward]) VALUES (N'D0000006492', CAST(0.00 AS Decimal(38, 2)), CAST(0.00 AS Decimal(38, 2)), CAST(0.00 AS Decimal(38, 2)), CAST(0.00 AS Decimal(38, 2)), CAST(0.00 AS Decimal(38, 2)))਍䤀一匀䔀刀吀 嬀搀戀漀崀⸀嬀䌀氀椀攀渀琀开䈀愀氀愀渀挀攀猀崀 ⠀嬀挀氀椀攀渀琀开>椀搀崀Ⰰ 嬀吀漀琀愀氀 䤀渀瘀漀椀挀攀猀崀Ⰰ 嬀吀漀琀愀氀 䐀攀戀椀琀猀崀Ⰰ 嬀吀漀琀愀氀 倀愀礀洀攀渀琀猀崀Ⰰ 嬀吀漀琀愀氀 䌀爀攀搀椀琀猀崀Ⰰ 嬀䈀愀氀愀渀挀攀 䘀漀爀眀愀爀搀崀⤀ 嘀䄀䰀唀䔀匀 ⠀一✀䐀  
       ㄀㄀㐀㌀ ✀Ⰰ 䌀䄀匀吀⠀㈀㠀 ⸀   䄀匀 䐀攀挀椀洀愀氀⠀㌀㠀Ⰰ ㈀⤀⤀Ⰰ 䌀䄀匀吀⠀ ⸀   䄀匀 䐀攀挀椀洀愀氀⠀㌀㠀Ⰰ ㈀⤀⤀Ⰰ 䌀䄀匀吀⠀㌀㄀㐀⸀   䄀匀 䐀攀挀椀洀愀氀⠀㌀㠀Ⰰ ㈀⤀⤀Ⰰ 䌀䄀匀吀⠀ ⸀   䄀匀
     䐀攀挀椀洀愀氀⠀㌀㠀Ⰰ ㈀⤀⤀Ⰰ 䌀䄀匀吀⠀ⴀ㌀㐀⸀   䄀匀 䐀攀挀椀洀愀氀⠀㌀㠀Ⰰ ㈀⤀⤀⤀ഀ

有关如何直接在文本.sql文件中清理它的任何想法?手动不是一个选项 - 该文件包含1310万行。字符也不一样,所以我无法弄清楚 - 我能写一个包含所有这些的正则表达式吗?

1 个答案:

答案 0 :(得分:2)

如果使用latin1或utf8编码打开文件,则可以看到“垃圾”字符实际上是未对齐的“INSERT”语句。我的猜测是unix样式的行结尾(LF)被转换为windows样式的行结尾(CRLF)而不考虑2字节的编码。

INSERT [dbo].[Client_Balances] ([client_id], [Total Invoices], [Total Debits], [Total Payments], [Total Credits], [Balance Forward]) VALUES (N'D0000006492', CAST(0.00 AS Decimal(38, 2)), CAST(0.00 AS Decimal(38, 2)), CAST(0.00 AS Decimal(38, 2)), CAST(0.00 AS Decimal(38, 2)), CAST(0.00 AS Decimal(38, 2)))
INSERT [dbo].[Client_Balances] ([client_id], [Total Invoices], [Total Debits], [Total Payments], [Total Credits], [Balance Forward]) VALUES (N'D0000011430', CAST(280.00 AS Decimal(38, 2)), CAST(0.00 AS Decimal(38, 2)), CAST(314.00 AS Decimal(38, 2)), CAST(0.00 AS Decimal(38, 2)), CAST(-34.00 AS Decimal(38, 2)))

您可以尝试将其反转,方法是将文件打开为Windows格式的latin1编码,然后将其保存为unix格式。然后使用utf16编码重新打开文件。 (用LF替换CRLF也应该有效)

在vim中:

:e ++enc=latin1 ++ff=dos data.sql
:w ++enc=latin1 ++ff=unix data2.sql
:e ++enc=utf16le data2.sql

(或尝试utf16be)。