Python 2.7删除特定的标点符号&停用词

时间:2016-11-25 12:55:55

标签: python python-2.7 punctuation

我想在代码中尝试3件事:

  • 删除特定标点符号
  • 将输入转换为小写
  • 删除停用词

如何在不使用' join的情况下删除标点符号。'功能?我是Python的新手,并且还没有成功地使用类似的方式删除停用词...

import string
s = raw_input("Search: ")    #user input
stopWords = [ "a", "i", "it", "am", "at", "on", "in", "to", "too", "very", \
          "of", "from", "here", "even", "the", "but", "and", "is", "my", \
          "them", "then", "this", "that", "than", "though", "so", "are" ]

PunctuationToRemove = [".", ",", ":", ";", "!" ,"?", "&"]

while s != "":
    s1 = ""

#Deleting punctuations and applying lowercase
    for c in s:                             #for characters in user's input
        if c not in PunctuationToRemove + " ": #characters that don't include punctuations and blanks
            s1 = s + c                      #store the above result to s1
            s1 = string.lower(s)            #then change s1 to lowercase
    print s1

3 个答案:

答案 0 :(得分:0)

摆脱你可以做的所有停止词:

[word for word in myString.split(" ") if word not in stopWords]

答案 1 :(得分:0)

我建议先摆脱所有标点符号。这可以使用for循环来完成:

for forbiddenChar in PunctuationToRemove:
    s = s.replace(forbiddenChar,"")        #Replace forbidden chars with empty string

然后,您可以使用s将输入字符串s.split(' ')拆分为单词。然后,您可以使用for循环将所有单词(小写)添加到新字符串s1

words = s.split(' ')
s1 = ""
for word in words:
    if word not in stopWords:
        s1 = s1 + string.lower(word) + " "

s1 = s1.rstrip(" ")         #Strip trailing space

答案 2 :(得分:0)

这个怎么样,

s = 'I am student! Hello world&.~*~'
PunctuationToRemove = [".", ",", ":", ";", "!" ,"?", "&"]
stopWords = set([ "a", "i", "it", "am", "at", "on", "in", "to", "too", "very", \
                "of", "from", "here", "even", "the", "but", "and", "is", "my", \
                "them", "then", "this", "that", "than", "though", "so", "are" ])

# Remove specific punctuations
s_removed_punctuations = s.translate(None, ''.join(PunctuationToRemove))

# Converte input to lowercase
s_lower = s_removed_punctuations.lower()

# Remove stop words
s_result = ' '.join(s for s in s_lower.split() if s not in stopWords).strip()

print(s_result)
#student hello world~*~