如何使用python send_tokenize打印文本文件的最短和最长句子?

时间:2019-04-09 12:13:34

标签: python nltk

我有一个程序,该程序:

a)计算并显示用户输入的文本文件的每个句子中的令牌数量 b)显示句子编号:句子1,句子2 ... c)显示每个句子的记号长度

问题:我还想显示文件的最长和最短句子,但是我的程序不计算具有最大令牌数的句子和具有最小令牌数的句子。我没有收到错误消息,但是得到的输出是:

此文件的最长句子包含1个令牌

此文件的最短句子包含1个令牌

此文件的平均句子长度为:56.55384615384615

我尝试为此使用max()和min()函数。我的代码在下面。

def sent_length():
    while True:
        try:
            file_to_open =Path(input("\nYOU CHOSE OPTION 1. Please,   insert your file path: "))
            #opens and tokenize the sentences of the file
            with open(file_to_open) as f:
                words = sent_tokenize(f.read()) 
                break
        except FileNotFoundError:
            print("\nFile not found. Better try again")
        except IsADirectoryError:
            print("\nIncorrect Directory path.Try again")
    print('\n\n This file contains',len(words),'sentences in total')

    sent_number=1

    for t in words:
        a=word_tokenize(t) #tokenize the sentence
        #displays the sentence number and the sentence length
        print('\n\nSentence',sent_number,'contains',len(a),   'tokens')
        sent_number+=1 


    wordcounts = [] 

    with open(file_to_open) as f:
        text = f.read()
        sentences = sent_tokenize(text)
        for sentence in sentences:
            words = word_tokenize(sentence)
            wordcounts.append(len(words)) # appends the length of each sentence in a list
    #calculates mean sentence length
    average_wordcount = sum(wordcounts)/len(wordcounts) 

    #loop through the sentences of the file and tokenize each sentence
    for x in words:
        tokenized_sentences=wordpunct_tokenize(x) 

    longest_sen = max(tokenized_sentences, key=len) #gets the maximum  number
    longest_sen_len = len(longest_sen)
    shortest_sen = min(tokenized_sentences, key=len) #gets the minimum number
    shortest_sen_len = len(shortest_sen)

    print ('\n\n The longest sentence of this file contains',longest_sen_len, 'tokens')
    print ('\n\n The shortest sentence of this file contains',shortest_sen_len,'tokens')
    print('\n\nThe mean sentence length of this file is: ',average_wordcount)

我的预期结果将是像这样的印刷品:

例如,此文件的最长句子包含70个令牌

例如,此文件的最短句子包含10个令牌

例如该文件的平均句子长度为:56.55384615384615

1 个答案:

答案 0 :(得分:1)

这种方法可能不是最好的方法,但可能会有所帮助。

import nltk
from nltk.tokenize import sent_tokenize
from statistics import mean

EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard."

tokened_sent = sent_tokenize(EXAMPLE_TEXT)

main_dict = {}

for item in tokened_sent:
    item1 = list(item.split(" "))
    item2 = [' '.join(item1)]
    Length = []
    Length.append(len(item1))
    mydict = dict(zip(item2, Length))
    main_dict.update(mydict)

print('Maximum Value: ', max(main_dict.values()))
print('Minimum Value: ', min(main_dict.values()))
print('average Value: ', mean(main_dict.values()))