Question

所以我有一个大文本文件（一本书），但我试图去除标点符号，特殊字符和空格的整个文本文件，这样我就可以形成所有单词的字典。出于某种原因，当我使用.strip（）方法时，它几乎什么也没做。

with open(filename, 'r') as file:
    entire = file.read()
    entire = entire.lower() #lower case the entire text (this works)
    entire = entire.strip(string.punctuations + string.digit) #this however does nothing

如何删除整本标点符号和数字，以便我可以构建字典？

Answer 1

您可以使用str.translate()删除字符：

import string

table = {ord(k) : None for k in string.punctuation + string.digits}
with open(filename, 'r') as f:
    entire = f.read().lower() #lower case the entire text (this works)
    entire = entire.translate(table)

table通过将其映射到None来指定要删除的字符。字典理解用于构造table。然后调用str.translate()执行删除。

Answer 2

str.strip不会超出字符串的末尾。例如：

>>> 'abcXYZabcXYZbca'.strip('abc')
'XYZabcXYZ'

您可以改为构建转换表并改为使用str.translate：

>>> import string
>>> table = str.maketrans({c: None for c in string.punctuation + string.digits})
>>> "Foo bar's baz, 123 abc".translate(table)
'Foo bars baz  abc'

Python：为什么.strip（）不能处理整个文件？

2 个答案: