Question

我似乎有一个问题从Python中的字符串中删除标点符号。在这里，我给了一个文本文件（特别是Project Gutenberg的一本书）和一个停用词列表。我想返回10个最常用单词的字典。不幸的是，我在返回的词典中不断打嗝。

import sys
import collections
from string import punctuation
import operator

#should return a string without punctuation
def strip_punc(s):
    return ''.join(c for c in s if c not in punctuation)

def word_cloud(infile, stopwordsfile):

    wordcount = {}

    #Reads the stopwords into a list
    stopwords = [x.strip() for x in open(stopwordsfile, 'r').readlines()]


    #reads data from the text file into a list
    lines = []
    with open(infile) as f:
        lines = f.readlines()
        lines = [line.split() for line in lines]

    #does the wordcount
    for line in lines:
        for word in line:
            word = strip_punc(word).lower()
            if word not in stopwords:
                if word not in wordcount:
                    wordcount[word] = 1
                else:
                    wordcount[word] += 1

    #sorts the dictionary, grabs 10 most common words
    output = dict(sorted(wordcount.items(),
                  key=operator.itemgetter(1), reverse=True)[:10])

    print(output)


if __name__=='__main__':

    try:

        word_cloud(sys.argv[1], sys.argv[2])

    except Exception as e:

        print('An exception has occured:')
        print(e)
        print('Try running as python3 word_cloud.py <input-text> <stopwords>')

这将打印出来

{'said': 659, 'mr': 606, 'one': 418, '“i': 416, 'lorry': 322, 'upon': 288, 'will': 276, 'defarge': 268, 'man': 264, 'little': 263}

“我不应该在那里。我不明白为什么在我的助手功能中没有消除它。

提前致谢。

Answer 1

字符“不是"。

string.punctuation仅包含以下ASCII字符：

In [1]: import string

In [2]: string.punctuation
Out[2]: '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

因此您需要扩充正在剥离的字符列表。

以下内容应该可以满足您的需求：

extended_punc = punctuation + '“' #  and any other characters you need to strip

def strip_punc(s):
    return ''.join(c for c in s if c not in extended_punc)

或者，您可以使用包unidecode来ASCII文本化您的文本，而不必担心创建您可能需要处理的unicode字符列表：

from unidecode import unidecode

def strip_punc(s):
    s = unidecode(s.decode('utf-8'))
    return ''.join(c for c in s if c not in punctuation).encode('utf-8')

Answer 2

正如其他答案中所述，问题是string.punctuation仅包含ASCII字符，因此缺少其他许多类似“的印刷（＆＃34;花式＆＃34;）引号。

您可以使用以下内容替换strip_punc功能：

def strip_punc(s):
    '''
    Remove all punctuation characters.
    '''
    return re.sub(r'[^\w\s]', '', s)

此方法使用re模块。正则表达式的工作方式如下：它匹配任何既不是字母数字（\w）也不是空格（\s）的字符，并将其替换为空字符串（即删除它）。

此解决方案利用了＆＃34;特殊序列＆＃34; \w和\s具有unicode感知功能，即。它们同样适用于任何脚本的任何字符，而不仅仅是ASCII：

>>> strip_punc("I said “naïve”, didn't I!")
'I said naïve didnt I'

请注意，\w包含下划线（_），因为它被视为＆＃34;字母数字＆＃34;。如果您想将其剥离，请将模式更改为：

r'[^\w\s]|_'

Answer 3

我会在strip_punc函数

上更改逻辑

from string import asci_letters

def strip_punc(word):
    return ''.join(c for c in word if c in ascii_letters)

这个逻辑是一个显式允许vs一个显式拒绝，这意味着你只允许你想要的值vs只阻止你知道你不想要的值，即省略你没有想到的任何边缘情况。 / p>

还要注意这一点。 Best way to strip punctuation from a string in Python

Answer 4

不知道停用词列表中的内容，最快的解决方案是添加：

#Reads the stopwords into a list
stopwords = [x.strip() for x in open(stopwordsfile, 'r').readlines()]
stopwords.append('“i')

继续使用其余的代码..

从Python字符串中删除标点符号

4 个答案: