如何计算文本文件句子中的字符数?

时间:2019-07-09 13:27:36

标签: python character nltk

我想将文本拆分为多个句子,然后打印每个句子的字符数,但是该程序无法计算每个句子的字符数。

我试图将用户输入的文件标记为句子,并循环遍历句子,并在每个句子中打印字符数。我尝试过的代码是:

from collections import defaultdict
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize,wordpunct_tokenize
import re
import os
import sys
from pathlib import Path

while True:
    try:
        file_to_open =Path(input("\nYOU SELECTED OPTION 8: 
            CALCULATE SENTENCE LENGTH. Please, insert your file 
path: "))
        with open(file_to_open,'r', encoding="utf-8") as f:
            words = sent_tokenize(f.read())
            break
    except FileNotFoundError:
        print("\nFile not found. Better try again")
    except IsADirectoryError:
        print("\nIncorrect Directory path.Try again")


print('\n\n This file contains',len(words),'sentences in total')



wordcounts = []
caracter_count=0
sent_number=1
with open(file_to_open) as f:
    text = f.read()
    sentences = sent_tokenize(text)
    for sentence in sentences:
        if sentence.isspace() !=True:
            caracter_count = caracter_count + 1
            print("Sentence", sent_number,'contains',caracter_count, 
'characters')
            sent_number +=1
            caracter_count = caracter_count + 1

我想打印一些东西:

“句子1具有35个字符” “第2句有45个字符”

以此类推。...

我通过该程序得到的输出是: 该文件总共包含4个句子 “句子1包含0个字符” “句子2包含1个字符” “句子3包含2个字符” “句子4包含3个字符”

任何人都可以帮助我做到这一点吗?

2 个答案:

答案 0 :(得分:0)

您没有用caracter_count计算句子中的字符数。我认为将您的for循环更改为:

sentence_number = 1
for sentence in sentences:
    if not sentence.isspace():
        print("Sentence {} contains {} characters".format(sentence_number, len(sentence))
        sentence_number += 1

可以正常工作

答案 1 :(得分:0)

您的问题似乎很有趣,这个问题有一个简单的解决方案。请记住,对于第一次运行,请在第一次运行后使用此命令“ nltk.download('punkt')”将其注释掉。

fatal: Not a valid object name: 'master'.

输出:

import nltk
#nltk.download('punkt')
from nltk.tokenize import sent_tokenize

def count_lines(file):
    count=0
    myfile=open(file,"r")
    string = ""

    for line in myfile:
        string+=line  
        print(string)

    number_of_sentences = sent_tokenize(string)

    for w in number_of_sentences:
        count+=1
        print("Sentence ",count,"has ",len(w),"words")

count_lines("D:\Atharva\demo.txt")