Question

我正在尝试使用Python检测文件中的独立换行符。该文件有一些独立的“LF”（即\ n）和一些“CRLF”（即\ r \ n）组合，我试图只匹配独立的组合。

我认为这样可行：

match = re.search('(?<!\r)\n', line)

其中line是来自循环文件的字符串。然而，背后的负面看法似乎并不奏效。

以下是上下文的完整脚本：

import sys
import fileinput
import os
import os.path
import re

# Descriptions: iterates over files in source directory, removes whitespace characters and saves to destination directory.


print ('Source Directory:', str(sys.argv[1]))
print ('Destination Directory:', str(sys.argv[2]))

for i in os.listdir(sys.argv[1]):
    fullSource = (os.path.join(sys.argv[1], i))
    fullDestination = (os.path.join(sys.argv[2], i))
    newfile = open(fullDestination, "x")
    for line in fileinput.input(fullSource):
        matchObj = re.search('(?<!\r)\n', line)
        if matchObj:
            newfile.write(line.rstrip('\r\n'))
        else:
            newfile.write(line)
    newfile.close
    print ("created " + fullDestination)

结果是删除了所有返回（CR和CRLF）。我错过了什么吗？

Answer 1

Well, this result is no surprise. fileinput module opens the files in text mode, by default, so \r\n are automatically changed in single \n. So the regex matches every line and removes all the \n - the \r have already been removed by fileinput.

So you must explicitely use a binary open mode. Unfortunately if you use Python 3.x (what your print syntax suggests), binary mode gives you bytes that you need to translate to strings. Your code could become:

import sys
import fileinput
import os
import os.path
import re

# Descriptions: iterates over files in source directory, removes whitespace characters and saves to destination directory.


print ('Source Directory:', str(sys.argv[1]))
print ('Destination Directory:', str(sys.argv[2]))

for i in os.listdir(sys.argv[1]):
    fullSource = (os.path.join(sys.argv[1], i))
    fullDestination = (os.path.join(sys.argv[2], i))
    newfile = open(fullDestination, "x")
    for line in fileinput.input(fullSource, mode='rb'):  # explicite binary mode
        line = line.decode('latin1')   # convert to string in Python3
        matchObj = re.search('(?<!\r)\n', line)
        if matchObj:
            newfile.write(line.rstrip('\r\n'))
        else:
            newfile.write(line)
    newfile.close
    print ("created " + fullDestination)

Answer 2

您的正则表达式正确匹配\n之前没有的\r字符：

>>> re.search('(?<!\r)\n', 'abc\r')
>>> re.search('(?<!\r)\n', 'abc\r\n')
>>> re.search('(?<!\r)\n', 'abc\n')
<_sre.SRE_Match object; span=(3, 4), match='\n'>

您的if和write错误：

if matchObj:  # "If line ends with '\n'"
    # Won't strip anything, because line ends with '\n', not '\r\n'.
    newfile.write(line.rstrip('\r\n'))
else:
    newfile.write(line)

你可能想做这样的事情：

if not matchObj:  # "If line ends with '\r\n'"
    # Note that strip('\r\n') removes these two characters, but does not add '\n' back.
    newfile.write(line.replace('\r\n', '\n'))
else:
    newfile.write(line)

顺便说一下，你不需要正则表达式来做你想做的事情，endswith()就足够了：

if line.endswith('\r\n'):
    newfile.write(line.replace('\r\n', '\n'))
else:
    newfile.write(line)

实际上，replace()本身就足够了：

newfile.write(line.replace('\r\n', '\n'))

使用Python检测换行符

2 个答案: