用stray \ r \ n字符逐行读取文件

时间:2017-03-28 03:52:57

标签: python python-3.x

我处理其他人生成的文本文件。这些文件的行由0xA字符分隔,但偶尔的行中会抛出0xD。这是我如何阅读文件:

for i, line in enumerate(open(file_path, "r", newline=chr(10))):
   ...

看起来,即使我告诉open使用0xA作为行分隔符,它仍然会被导致它解析不完整行的杂散0xD混淆。我错过了什么?

(在Windows上进行处理)

2 个答案:

答案 0 :(得分:1)

它似乎按预期工作(Python 3.5):

>>> f = open('test.txt', 'wb') # write in binary mode so nothing is changed
>>> f.write('both\r\nnewline\ncarriagereturn\rbothagain\r\n'.encode('utf-8'))
40    
>>> f.close()

>>> open('test.txt', 'rb').read() # confirm data is intact
>>> b'both\r\nnewline\ncarriagereturn\rbothagain\r\n'

>>> list(open('test.txt', 'r', newline=None)) # universal mode (convert everything to '\n')
['both\n', 'newline\n', 'carriagereturn\n', 'bothagain\n']

>>> list(open('test.txt', 'r', newline='')) # universal mode but leave data unchanged
['both\r\n', 'newline\n', 'carriagereturn\r', 'bothagain\r\n']

>>> list(open('test.txt', 'r', newline='\n')) # split only on '\n'
['both\r\n', 'newline\n', 'carriagereturn\rbothagain\r\n']

>>> list(open('test.txt', 'r', newline='\r')) # split only on '\r'
['both\r', '\nnewline\ncarriagereturn\r', 'bothagain\r', '\n']

>>> list(open('test.txt', 'r', newline='\r\n')) # split only on '\r\n'
['both\r\n', 'newline\ncarriagereturn\rbothagain\r\n']

你能发布一些样本数据吗?验证码?

答案 1 :(得分:0)

您可以手动分割线吗?

for i, line in enumerate(open(file_path, "r").read().split('\n')):
    ...