Question

我大约有2,000个文本文件，其中包含新闻摘要，并且我想使用Python从所有具有标题（由于某种原因而没有标题）的文件中删除标题。

这是一个例子：

Ad sales boost Time Warner profit 

Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters.However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues.Time Warner's fourth quarter profits were slightly better than analysts' expectations.For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn.For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins.

我的问题是如何删除“广告销售可增加时代华纳利润”这一行？

编辑：我基本上想在换行之前删除所有内容。

TIA。

Answer 1

如果（如您所说）只是删除第一行的简单问题，那么在后面跟随\n\n时，您可以使用像这样的简单正则表达式：

import re

with open('testing.txt', 'r') as fin:
    doc = fin.read()

doc = re.sub(r'^.+?\n\n', '', doc)

Answer 2

这将删除第一个换行符（'\n\n'）之前的所有内容。

with open('text.txt', 'r') as file:
    f = file.read()

idx = f.find('\n\n') # Search for a line break
if idx > 0:          # If found, return everything after it
    g = f[idx+2:]
else:                # Otherwise, return the original text file
    g = f

print(g)

# Save the file
with open('text.txt', 'w') as file:
    file.write(g)

"Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters.However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues.Time Warner's fourth quarter profits were slightly better than analysts' expectations.For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn.For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins.\n"

Answer 3

尝试：它将在换行符“ \ n \ n”之前将文本拆分为所有内容，并且仅选择最后一个元素（正文）

line.split('\n\n', 1)[-1]

当文本中没有换行符时也可以使用

Answer 4

您可能知道，您无法读写文件。 -因此，这种情况下的解决方案是将行读取到变量中；修改并重新写入文件。

lines = []

# open the text file in read mode and readlines (returns a list of lines)
with open('textfile.txt', 'r') as file:
    lines = file.readlines()

# open the text file in write mode and write lines
with open('textfile.txt', 'w') as file:
    # if the number of lines is bigger than 1 (assumption) write summary else write all lines
    file.writelines(lines[2:] if len(lines) > 1 else lines)

上面是一个简单的示例，说明了如何实现自己的目标。 -尽管请记住可能会出现一些极端情况。

如何从Python的文本文件中删除标题？

4 个答案: