Question

文本文件（file.txt）如下：

First line.
2. Second line 
03 Third line
04. Fourth line
5. Line. 
6 Line

所需的输出是1）删除行首的数字，以及2）删除标点符号：

First line.
Second line
Third line
Fourth line
Line.
Line

我尝试过：

import re
file=open("file.txt").read().split()
print([i for i in file if re.sub("[0-9]\.*", "", i)])

但是我只在单词级别而不是行级别得到结果：

['First', 'line.', 'Second', 'line', 'Third', 'line', 'Fourth', 'line', 'Line.', 'Line']

Answer 1

请勿在循环re中使用for模块。使用正则表达式的可能性很多，re模块也可以用作多行。例如，使用以下命令：

>>> with open('/tmp/file.txt', 'r') as f:
        s = f.read()
>>> # or use direct value to test in the Python console:
>>> s = """First line.
... 2. Second line
... 03 Third line
... 04. Fourth line
... 5. Line.
... 6 Line"""

>>> s
'First line.\n2. Second line \n03 Third line\n04. Fourth line\n5. Line. \n6 Line'

>>> import re

>>> re.sub(r'[0-9\.\s]*(.*)', r'\1\n', s, flags=re.M)
'First line.\nSecond line \nThird line\nFourth line\nLine. \nLine\n'

>>> re.sub(r'^[0-9\.\s]*(.*)', r'\1', a, flags=re.M)
'First line.\nSecond line \nThird line\nFourth line\nLine. \nLine'

Answer 2

您可以使用修复当前代码

with open("file.txt") as f:
    for line in f:
        print(re.sub("^[0-9]+\.?\s*", "", line.rstrip("\n")))

查看Python demo。

您需要打开一个文件并逐行阅读。然后，^[0-9]+\.?\s*模式搜索1个或多个数字（[0-9]+），后跟一个可选的.（\.?），然后搜索0+空格（\s*）每行并删除匹配项（如果找到）。

Answer 3

此行中的拆分

file=open("file.txt").read().split()

用空格分隔文件。使用

file=open("file.txt").read().split("\n")

而不是按行分割文件。

Answer 4

另一个选择是：

import re
f = """First line.
2. Second line
03 Third line
04. Fourth line
5. Line.
6 Line"""
print(re.sub(r"(\d{1,2}\.{,1}\s)", "", f));

它返回：

First line.
Second line
Third line
Fourth line
Line.
Line

它不必遍历每一行。

使用正则表达式从文本文件的每一行中删除子字符串

4 个答案: