我正在尝试解析数据库文件的每一行,以便为导入做好准备。它有固定的宽度线,但是以字符为单位,而不是以字节为单位。我已根据Martineau's answer对某些内容进行了编码,但我遇到了特殊字符的问题。
有时它们会破坏预期的宽度,有时它们会抛出UnicodeDecodeError。我相信解码错误可以修复,但我可以继续这样做struct.unpack
并正确解码特殊字符吗?我认为问题是它们被编码为多个字节,弄乱了预期的字段宽度,我理解为字节而不是字符。
import os, csv
def ParseLine( arquivo):
import struct, string
format = "1x 12s 1x 18s 1x 16s"
expand = struct.Struct(format).unpack_from
unpack = lambda line: tuple(s.decode() for s in expand(line.encode()))
for line in arquivo:
fields = unpack(line)
yield [x.strip() for x in fields]
Caminho = r"C:\Sample"
os.chdir(Caminho)
with open("Sample data.txt", 'r') as arq:
with open("Out" + ".csv", "w", newline ='') as sai:
Write = csv.writer(sai, delimiter= ";", quoting=csv.QUOTE_MINIMAL).writerows
for line in ParseLine(arq):
Write([line])
示例数据:
| field 1| field 2 | field 3 |
| sreaodrsa | raesodaso t.thl o| .tdosadot. osa |
| resaodra | rôn. 2x 17/220V | sreao.tttra v |
| esarod sê | raesodaso t.thl o| .tdosadot. osa |
| esarod sa í| raesodaso t.thl o| .tdosadot. osa |
实际输出:
field 1;field 2;field 3
sreaodrsa;raesodaso t.thl o;.tdosadot. osa
resaodra;rôn. 2x 17/22;V | sreao.tttra
在输出中,我们看到第1行和第2行是预期的。第3行的宽度错误,可能是由于多字节ô
。第4行引发以下异常:
Traceback (most recent call last):
File "C:\Sample\FindSample.py", line 18, in <module>
for line in ParseLine(arq):
File "C:\Sample\FindSample.py", line 9, in ParseLine
fields = unpack(line)
File "C:\Sample\FindSample.py", line 7, in <lambda>
unpack = lambda line: tuple(s.decode() for s in expand(line.encode()))
File "C:\Sample\FindSample.py", line 7, in <genexpr>
unpack = lambda line: tuple(s.decode() for s in expand(line.encode()))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 11: unexpected end of data
我需要对每个字段执行特定操作,因此我不能像以前那样在整个文件上使用re.sub
。我想保留这些代码,因为它似乎很有效并且处于工作的边缘。如果有一些更有效的解析方法,我可以尝试一下。我需要保留特殊字符。
答案 0 :(得分:0)
实际上,struct
方法在这里落空,因为它期望字段是固定数量的字节宽,而您的格式使用固定数量的代码点
我根本不在这里使用struct
。您的行已经解码为Unicode值,只需使用切片来提取数据:
def ParseLine(arquivo):
slices = [slice(1, 13), slice(14, 32), slice(33, 49)]
for line in arquivo:
yield [line[s].strip() for s in slices]
这完全处理已经解码的行中的字符,而不是字节。如果您有字段宽度而不是索引,则还可以生成slice()
个对象:
def widths_to_slices(widths):
pos = 0
for width in widths:
pos += 1 # delimiter
yield slice(pos, pos + width)
pos += width
def ParseLine(arquivo):
widths = (12, 18, 16)
for line in arquivo:
yield [line[s].strip() for s in widths_to_slices(widths)]
演示:
>>> sample = '''\
... | field 1| field 2 | field 3 |
... | sreaodrsa | raesodaso t.thl o| .tdosadot. osa |
... | resaodra | rôn. 2x 17/220V | sreao.tttra v |
... | esarod sê | raesodaso t.thl o| .tdosadot. osa |
... | esarod sa í| raesodaso t.thl o| .tdosadot. osa |
... '''.splitlines()
>>> def ParseLine(arquivo):
... slices = [slice(1, 13), slice(14, 32), slice(33, 49)]
... for line in arquivo:
... yield [line[s].strip() for s in slices]
...
>>> for line in ParseLine(sample):
... print(line)
...
['field 1', 'field 2', 'field 3']
['sreaodrsa', 'raesodaso t.thl o', '.tdosadot. osa']
['resaodra', 'rôn. 2x 17/220V', 'sreao.tttra v']
['esarod sê', 'raesodaso t.thl o', '.tdosadot. osa']
['esarod sa í', 'raesodaso t.thl o', '.tdosadot. osa']