Question

我有一个从PDF转换的长文本文件，我想删除某些内容的实例，例如喜欢自己出现但可能被空格包围的页码。我制作了一个适用于短线的正则表达式：例如

news1 = 'Hello done.\n4\nNext paragraph.'
m = re.sub('\n *[0-9] *\n', ' ', news1)
print(m)
Hello done. Next paragraph.

但是当我在更复杂的字符串上尝试这个时，它会失败，例如

news = '1   \n  Hello done. \n 4 \n  44 \n  Next paragraph.'
m = re.sub('\n *[0-9] *\n', ' ', news)
print(m)
1   
  Hello done.    44 
Next paragraph.

如何在整个文件中完成此工作？我应该逐行阅读并按行处理，而不是尝试编辑整个字符串吗？

我也尝试过使用这些时段来匹配任何但没有获得最初的＆＃39; 1＆＃39;在更复杂的字符串中。所以我想我可以做2个正则表达式。

m = re.sub('. *[0-9] *.', '', news)
1   
  Hello done. 


  Next paragraph.

思想？

Answer 1

我建议逐行进行，除非你有一些特定的理由将它全部作为一个字符串。然后只需要几个正则表达式来清理它们：

#not sure how the pages are numbered, but perhaps...
text = re.sub(r"^\s*\d+\s*$", "", text)

#chuck a line in to strip out stuff in all caps of at least 3 letters
text = re.sub(r"[A-Z]{3,}", "", text)

#concatenate multiple whitespace to 1 space, handy to clean up the data
text = re.sub(r"\s+", " ", text)

#trim the start and end of the line
text = text.strip()

只有一个策略，但这就是我想要的路线，随着您的业务方面的提出，很容易保持良好的发展态度＆＃34; OH OH！您是否也可以替换任何提及的“猫”。和＃39; Dog＆＃39;？＆＃34;我认为更容易更改/记录您的更改。也许甚至尝试使用re.subn来跟踪变化......？

Python3和正则表达式：如何删除数字行？

1 个答案: