Question

这是我的字符串：

mystring = "How’s it going?"

这就是我所做的：

import string
exclude = set(string.punctuation)

def strip_punctuations(mystring):
    for c in string.punctuation:
        new_string=''.join(ch for ch in mystring if ch not in exclude)
        new_string = chat_string.replace("\xe2\x80\x99","")
        new_string = chat_string.replace("\xc2\xa0\xc2\xa0","")
    return chat_string

输出：

如果我没有包含此行new_string = chat_string.replace("\xe2\x80\x99","")，那么这将是输出：

 'How\xe2\x80\x99s it going'

我意识到了 exclude在列表中没有那种奇怪的撇号：

print set(exclude)
set(['!', '#', '"', '%', '$', "'", '&', ')', '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', '<', '?', '>', '@', '[', ']', '\\', '_', '^', '`', '{', '}', '|', '~'])

我如何确保取出所有这些字符，而不是将来手动更换它们？

Answer 1

如果您正在处理新闻文章或网络报废等长篇文章，那么您可以使用＆＃34; goose＆＃34;或＆＃34; NLTK＆＃34; python库。这两个没有预先安装。以下是图书馆的链接。 goose，NLTK

您可以浏览该文档并了解如何操作。

OR

如果您不想使用这些库，您可能需要创建自己的＆＃34;排除＆＃34;手动列出。

Answer 2

import re

toReplace = "how's it going?"
regex = re.compile('[!#%$\"&)\'(+*-/.;:=<?>@\[\]_^`\{\}|~"\\\\"]')
newVal = regex.sub('', toReplace)
print(newVal)

正则表达式匹配您设置的所有字符，并用空白空格替换它们。

python删除奇怪的撇号和其他奇怪的字符不在string.punctuation中

2 个答案: