我正在尝试使用Python检测文件中的独立换行符。该文件有一些独立的“LF”(即\ n)和一些“CRLF”(即\ r \ n)组合,我试图只匹配独立的组合。
我认为这样可行:
match = re.search('(?<!\r)\n', line)
其中line
是来自循环文件的字符串。然而,背后的负面看法似乎并不奏效。
以下是上下文的完整脚本:
import sys
import fileinput
import os
import os.path
import re
# Descriptions: iterates over files in source directory, removes whitespace characters and saves to destination directory.
print ('Source Directory:', str(sys.argv[1]))
print ('Destination Directory:', str(sys.argv[2]))
for i in os.listdir(sys.argv[1]):
fullSource = (os.path.join(sys.argv[1], i))
fullDestination = (os.path.join(sys.argv[2], i))
newfile = open(fullDestination, "x")
for line in fileinput.input(fullSource):
matchObj = re.search('(?<!\r)\n', line)
if matchObj:
newfile.write(line.rstrip('\r\n'))
else:
newfile.write(line)
newfile.close
print ("created " + fullDestination)
结果是删除了所有返回(CR和CRLF)。 我错过了什么吗?
答案 0 :(得分:1)
Well, this result is no surprise. fileinput
module opens the files in text mode, by default, so \r\n
are automatically changed in single \n
. So the regex matches every line and removes all the \n
- the \r
have already been removed by fileinput
.
So you must explicitely use a binary open mode. Unfortunately if you use Python 3.x (what your print
syntax suggests), binary mode gives you bytes that you need to translate to strings. Your code could become:
import sys
import fileinput
import os
import os.path
import re
# Descriptions: iterates over files in source directory, removes whitespace characters and saves to destination directory.
print ('Source Directory:', str(sys.argv[1]))
print ('Destination Directory:', str(sys.argv[2]))
for i in os.listdir(sys.argv[1]):
fullSource = (os.path.join(sys.argv[1], i))
fullDestination = (os.path.join(sys.argv[2], i))
newfile = open(fullDestination, "x")
for line in fileinput.input(fullSource, mode='rb'): # explicite binary mode
line = line.decode('latin1') # convert to string in Python3
matchObj = re.search('(?<!\r)\n', line)
if matchObj:
newfile.write(line.rstrip('\r\n'))
else:
newfile.write(line)
newfile.close
print ("created " + fullDestination)
答案 1 :(得分:0)
您的正则表达式正确匹配\n
之前没有的\r
字符:
>>> re.search('(?<!\r)\n', 'abc\r')
>>> re.search('(?<!\r)\n', 'abc\r\n')
>>> re.search('(?<!\r)\n', 'abc\n')
<_sre.SRE_Match object; span=(3, 4), match='\n'>
您的if
和write
错误:
if matchObj: # "If line ends with '\n'"
# Won't strip anything, because line ends with '\n', not '\r\n'.
newfile.write(line.rstrip('\r\n'))
else:
newfile.write(line)
你可能想做这样的事情:
if not matchObj: # "If line ends with '\r\n'"
# Note that strip('\r\n') removes these two characters, but does not add '\n' back.
newfile.write(line.replace('\r\n', '\n'))
else:
newfile.write(line)
顺便说一下,你不需要正则表达式来做你想做的事情,endswith()
就足够了:
if line.endswith('\r\n'):
newfile.write(line.replace('\r\n', '\n'))
else:
newfile.write(line)
实际上,replace()
本身就足够了:
newfile.write(line.replace('\r\n', '\n'))