Question

我对 string.punctuation 有疑问。

我使用的是NLTK，我需要从标点符号中清除文本（文本已经分为带有函数word_tokenize(my_str)的标记）。

我编写了简单的函数来完成这项工作，但在调用这些函数后，我发现 双引号 标记仍然存在！其他的，如逗号，句号和其他特殊是正确清楚，但不是双引号。为什么？如果我在Python解释器中打印string.punctuation，我会读取被认为是标点符号的char列表，所以还要双引号：

 >>>import string
 >>>print string.punctuation 
    !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

我的功能是：

def is_punct_char(char):
    return char in string.punctuation

def is_not_punct_char(char):
    return not is_punct_char(char)

# clear punct: 
#\par:    lista di token
#\return: lista di bigrammi (token, PoS)
def erase_punct(token_list):
    return filter(is_not_punct_char, token_list)

原文是：

你好，你好吗？我很好，谢谢。你呢？不是很好＆＃34;嗯＆＃34;。

标记后，输出为：

[你好＆＃39;你好，＆＃39;，你＆＃39;你＆＃39;，你＆＃39;，你＆＃39;你＆＃39;，你＆＃39;，你＆＃39;我＆＃39;你＆＃34;＆＃39; m＆＃34;，你好＆＃39;，你好，＆＃39;，你好，＆＃39;，你＆＃39;，你＆＃39;和＆＃39;，你＆＃39;你＆＃39;，你＆＃39; ;？＆＃39;，你＆＃39;不＆＃39;，你非常＆＃39;，你＆＃39;``＆＃39;，你好＆＃39;，你＆＃34;＆＃39;＆＃39;＆＃39;，你＆＃39;。＆＃39;]

从标点符号清除后输出为：

[u＆＃39;你好＆＃39;，你＆＃39;你＆＃39;你是＆＃39;，你＆＃39;你＆＃39;，你＆＃39;我＆＃39;，你＆＃34;＆＃39; m＆＃34;，u＆＃39; ok＆＃39;，u＆＃39;感谢＆＃39;，你＆＃39;，你＆＃39;你＆＃39;你＆＃39;你＆＃39;，你＆＃39;，你＆＃39;``＆＃39;，你＆＃39; ;嗯＆＃39;，ü＆＃34;＆＃39;＆＃39;＆＃34;]

这是不正确的。作为最后一个标记我期待u'well'，而不是它周围的两个＆＃34;＆＃34; （u'``'和u"''"）。

有人可以帮助我吗？

双引号在Python 2.7中不被识别为标点符号？

0 个答案: