Question

下面的代码设法在就职时删除txt中的所有停用词，但我唯一的问题是我还需要从列表中删除标点符号。有关如何做到这一点的任何帮助。

def content_text(inaugural):
    stopwords = set(nltk.corpus.stopwords.words('english'))
    w_stp = Counter()
    wo_stp  = Counter()
    for word in inaugural:

        word = word.lower()
        if word in stopwords:
             w_stp.update([word])
        else:

            wo_stp.update([word])

    return [k for k,_ in w_stp.most_common(10)],[y for y,_ in wo_stp.most_common(10)]

print(content_text(nltk.corpus.inaugural.words('1861-Lincoln.txt', )))
print(content_text(nltk.corpus.inaugural.words('1941-Roosevelt.txt', )))
print(content_text(nltk.corpus.inaugural.words('1945-Roosevelt.txt', )))
print(content_text(nltk.corpus.inaugural.words('1981-Reagan.txt', )))
print(content_text(nltk.corpus.inaugural.words('1985-Reagan.txt', )))

Answer 1

实现这一目标的一个好方法是使用RegEx：

import re    
re.sub('[^A-Za-z0-9]+', ' ', nltk.corpus.inaugural.words(**replace with speeches**))

这将删除所有不是单词或数字的字符。

Answer 2

怎么样：

for word in inaugural:

    word = word.lower().replace(',', '').replace(';','').replace('.',''))
    if len(word.strip()) > 0:
        if word in stopwords:
            w_stp.update([word])
        else:
            wo_stp.update([word])

根据需要添加更多标点符号。

<强>说明：

在处理每个word时，检查它是否有标点符号。如果是，请将其删除。接下来，检查整个单词是否是标点符号。如果是，则长度为0并且不需要进一步处理它。否则，处理其余的单词。

原始建议

 def content_text(inaugural):
    inaugural = inaugural.replace(',', '').replace(';','').replace('.',''))
    (... the rest of the method...)

这是错误的，因为inaugural不是字符串。 @Sam发现了这个错误。

Answer 3

您可以使用Python的maketrans和translate函数，如下所示：

import string

def remove_punctuation(s):
    return s.translate(str.maketrans(string.punctuation, " " * len(string.punctuation))).replace("  ", " ")

print(remove_punctuation("Test@this!!out"))

这将显示以下内容：

Test this out

如何从txt（就职）中删除标点符号？

3 个答案: