Question

这个问题（Best way to strip punctuation from a string in Python）涉及从单个字符串中删除标点符号。但是，我希望从输入文件中读取文本，但只打印出所有字符串的一个COPY而不结束标点符号。我已经开始这样的事了：

f = open('#file name ...', 'a+')
for x in set(f.read().split()):
    print x

但问题是如果输入文件有这一行：

This is not is, clearly is: weird

它以不同的方式对待“是”的三种不同情况，但我想忽略任何标点符号并将其打印为“仅”一次，而不是三次。如何删除任何结尾标点符号，然后将结果字符串放入集合中？

感谢您的帮助。（我对Python很陌生。）

Answer 1

import re

for x in set(re.findall(r'\b\w+\b', f.read())):

应该能够更正确地区分单词。

此正则表达式查找紧凑的字母数字字符组（a-z，A-Z，0-9，_）。

如果您只想找到字母（没有数字而没有下划线），请将\w替换为[a-zA-Z]。

>>> re.findall(r'\b\w+\b', "This is not is, clearly is: weird")
['This', 'is', 'not', 'is', 'clearly', 'is', 'weird']

Answer 2

如果您不关心用空格替换标点字符，则可以使用翻译表，例如。

>>> from string import maketrans
>>> punctuation = ",;.:"
>>> replacement = "    "
>>> trans_table = maketrans(punctuation, replacement)
>>> 'This is not is, clearly is: weird'.translate(trans_table)
'This is not is  clearly is  weird'
# And for your case of creating a set of unique words.
>>> set('This is not is  clearly is  weird'.split())
set(['This', 'not', 'is', 'clearly', 'weird'])

从输入文件中的唯一字符串中删除标点符号

2 个答案: