仅当ASCII值

时间:2017-01-19 16:13:46

标签: python utf-8

我正在尝试编写一个程序,它允许我将SQL文件相互比较,并通过将完整的SQL文件写入文本文件来开始。文本文件生成成功,但最后使用块,如下例所示:

SET ANSI_NULLS ON਍ഀ
GO਍ഀ
SET QUOTED_IDENTIFIER ON਍ഀ 
GO਍ഀ
CREATE TABLE [dbo].[CDR](਍ഀ

下面是生成文本文件的代码

#!/usr/bin/python
# -*- coding: utf-8 -*-

import os
from _ast import Num 

#imports packages
r= open('master_lines.txt', 'w')

directory= "E:\\" #file directory, anonymous omission
master= directory + "master" 
databases= ["\\1", "\\2", "\\3", "\\4"]
file_types= ["\\StoredProcedure", "\\Table", "\\UserDefinedFunction", "\\View"]
servers= []
server_number= []
master_lines= []

for file in os.listdir("E:\\"):     #adds server paths to an array   
    servers.append(file)

for num in range(0, len(servers)):
    for file in os.listdir(directory + servers[num]):      #adds all the servers and paths to an array 
        server_number.append(servers[num] + "\\" + file)

master= directory + server_number[server_number.index("master")]

master_var= master + databases[0]

tmp= master_var + file_types[1]
for file in os.listdir(tmp):
    with open(file) as tmp_file:
        line= tmp_file.readlines()
    for num in range(0, len(line)):
        r.write(line[num])                      

r.close

我已经尝试将编码更改为latin1和utf-8;当前的文本文件是最成功的,因为ascii和latin1分别生成了中文和阿拉伯字符。

以下是文本格式的SQL文件:

/****** Object:  Table [dbo].[CDR]    Script Date: 2017-01-12 02:30:49 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[CDR](
    [calldate] [datetime] NOT NULL,
    [clid] [varchar](80) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    [src] [varchar](80) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    [dst] [varchar](80) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    [dcontext] [varchar](80) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    [channel] [varchar](80) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    [dstchannel] [varchar](80) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    [lastapp] [varchar](80) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    [lastdata] [varchar](80) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    [duration] [int] NOT NULL,
    [billsec] [int] NOT NULL,
    [disposition] [varchar](45) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    [amaflags] [int] NOT NULL,
    [accountcode] [varchar](20) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    [userfield] [varchar](255) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    [uniqueid] [varchar](64) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    [cdr_id] [int] NOT NULL,
    [cost] [real] NOT NULL,
    [cdr_tag] [varchar](10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
    [importID] [bigint] IDENTITY(-9223372036854775807,1) NOT NULL,
 CONSTRAINT [PK_CDR_1] PRIMARY KEY CLUSTERED 
(
    [uniqueid] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [ReadPartition]
) ON [ReadPartition]

GO
SET ANSI_PADDING ON

GO
/****** Object:  Index [Idx_Dst_incl_uniqueId]    Script Date: 2017-01-12 02:30:50 PM ******/
CREATE NONCLUSTERED INDEX [Idx_Dst_incl_uniqueId] ON [dbo].[CDR]
(
    [dst] ASC
)
INCLUDE (   [uniqueid]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [ReadPartition]
GO

十六进制转储以了解发生了什么,不是上述问题的一部分:

ff fe 2f 00 2a 00 2a 00 2a 00 2a 00 2a 00 2a 00 
20 00 4f 00 62 00 6a 00 65 00 63 00 74 00 3a 00 
20 00 20 00 54 00 61 00 62 00 6c 00 65 00 20 00 
5b 00 64 00 62 00 6f 00 5d 00 2e 00 5b 00 43 00 
44 00 52 00 5d 00 20 00 20 00 20 00 20 00 53 00 
63 00 72 00 69 00 70 00 74 00 20 00 44 00 61 00 
74 00 65 00 3a 00 20 00 32 00 30 00 31 00 37 00 
2d 00 30 00 31 00 2d 00 31 00 32 00 20 00 30 00 
32 00 3a 00 33 00 30 00 3a 00 34 00 39 00 20 00 
50 00 4d 00 20 00 2a 00 2a 00 2a 00 2a 00 2a 00 
2a 00 2f 00 0d 00 0a 00 53 00 45 00 54 00 20 00 
41 00 4e 00 53 00 49 00 5f 00 4e 00 55 00 4c 00 
4c 00 53 00 20 00 4f 00 4e 00 0d 00 0a 00 47 00 
4f 00 0d 00 0a 00 53 00 45 00 54 00 20 00 51 00 
55 00 4f 00 54 00 45 00 44 00 5f 00 49 00 44 00 

hexdump的结果:

../.*.*.*.*.*.*.
.O.b.j.e.c.t.:.
. .T.a.b.l.e. .
[.d.b.o.]...[.C.
D.R.]. . . . .S.
c.r.i.p.t. .D.a.
t.e.:. .2.0.1.7.
-.0.1.-.1.2. .0.
2.:.3.0.:.4.9. .
P.M. .*.*.*.*.*.
*./.....S.E.T. .
A.N.S.I._.N.U.L.
L.S. .O.N.....G.
O.....S.E.T. .Q.
U.O.T.E.D._.I.D.

1 个答案:

答案 0 :(得分:1)

您的问题是原始文件采用UTF-16编码,并带有初始字节顺序标记。它通常在Windows上是透明的,因为几乎所有文件编辑器都会通过初始BOM自动读取它。

但是Python脚本的转换不是自动的!这意味着每个字符都被读作字符本身后跟一个null。它除了行尾之外几乎是透明的,因为空值只是再次写回以形成正常的UTF16字符。但是\n不再以原始\r开头,但是如果你在文本模式下编写了一个null,那么Python会用一对\r\n替换它,它不再是有效的UTF16字符,这会导致集团显示。

修复这个问题很简单,只需在读取文件时声明UTF16编码:

for file in os.listdir(tmp):
    with open(file, encoding='utf_16_le') as tmp_file:

或者,如果要保留UTF16编码,还可以使用它打开主文件。默认情况下,Python会将其编码为utf8。但我的建议是恢复到8位编码文件,以避免在以后想要处理输出文件时出现进一步的问题。