建议代码改进:计算单词的实例,同时忽略标点符号和常用单词

时间:2017-04-17 23:39:59

标签: python list file dictionary tuples

此程序的这一点是在忽略标点符号和文章和连词的同时计算单词的出现次数。所需的输出是使用前15个单词的列表,使用底部15个单词而不显示它们的出现。我是初学者,非常感谢任何帮助。谢谢!

# This program reads a text file, performs a content analysis
# and prints both a top 15 and a bottom 15 report

name = input('Enter name of file: ')

 # Clean Function
def clean(s):
    punctuations = ["!","@","#","$"]
    art_con = ['the','a','an','some','and','but','or','nor','for']
    for each in punctuations:
        s = s.replace(each,"")
    words = s.split()
    resultwords = [word for word in words if word.lower() not in art_con]
    result= ''.join(resultwords)
    return result

# Analyze Function
def analyze(name):
    print('Reading',name,'for analysis...')
    print('===========================')
    print('Creating content analysis dictionary...')
    r = open(name, 'r')
    s = r.read()
    result = clean(s)
    count = dict((x,result.count(x)) for x in set(result))
    print('Analysis complete!')
    print('===================')
    return count

count = analyze(name)

# turn dictionary into a list of tuples to sort
def function(count):
    list1 = []
    for key in count:
        t = (count[key],key)
        list1.append((t))
    list1.sort()
    result = [list1[i] for i in range(len(list1))]
    t15 = result[0:15]
    b15 = result[-15:0]
    print("The top 15 words are ",t15)
    print("The bottom 15 words are ",b15)

#Main Function
def main():
    count = analyze(name)
    function(count)
main() 

1 个答案:

答案 0 :(得分:0)

  

我是初学者,非常感谢任何帮助。

总的来说,代码看起来不错。 clean()函数可以通过以下方式更快更简洁:1)在开始时对输入字符串进行下限,2)使用正则表达式提取单词而忽略标点符号,以及3)消除使用集差操作的常用词。

这是一个粗略的方法,可以帮助您入门:

words = re.findall(r"[a-z\'\-]+", s.lower())
return set(words) - {'the','a','an','some','and','but','or','nor','for'}