我想将文本拆分为多个句子,然后打印每个句子的字符数,但是该程序无法计算每个句子的字符数。
我试图将用户输入的文件标记为句子,并循环遍历句子,并在每个句子中打印字符数。我尝试过的代码是:
from collections import defaultdict
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize,wordpunct_tokenize
import re
import os
import sys
from pathlib import Path
while True:
try:
file_to_open =Path(input("\nYOU SELECTED OPTION 8:
CALCULATE SENTENCE LENGTH. Please, insert your file
path: "))
with open(file_to_open,'r', encoding="utf-8") as f:
words = sent_tokenize(f.read())
break
except FileNotFoundError:
print("\nFile not found. Better try again")
except IsADirectoryError:
print("\nIncorrect Directory path.Try again")
print('\n\n This file contains',len(words),'sentences in total')
wordcounts = []
caracter_count=0
sent_number=1
with open(file_to_open) as f:
text = f.read()
sentences = sent_tokenize(text)
for sentence in sentences:
if sentence.isspace() !=True:
caracter_count = caracter_count + 1
print("Sentence", sent_number,'contains',caracter_count,
'characters')
sent_number +=1
caracter_count = caracter_count + 1
我想打印一些东西:
“句子1具有35个字符” “第2句有45个字符”
以此类推。...
我通过该程序得到的输出是: 该文件总共包含4个句子 “句子1包含0个字符” “句子2包含1个字符” “句子3包含2个字符” “句子4包含3个字符”
任何人都可以帮助我做到这一点吗?
答案 0 :(得分:0)
您没有用caracter_count计算句子中的字符数。我认为将您的for循环更改为:
sentence_number = 1
for sentence in sentences:
if not sentence.isspace():
print("Sentence {} contains {} characters".format(sentence_number, len(sentence))
sentence_number += 1
可以正常工作
答案 1 :(得分:0)
您的问题似乎很有趣,这个问题有一个简单的解决方案。请记住,对于第一次运行,请在第一次运行后使用此命令“ nltk.download('punkt')”将其注释掉。
fatal: Not a valid object name: 'master'.
输出:
import nltk
#nltk.download('punkt')
from nltk.tokenize import sent_tokenize
def count_lines(file):
count=0
myfile=open(file,"r")
string = ""
for line in myfile:
string+=line
print(string)
number_of_sentences = sent_tokenize(string)
for w in number_of_sentences:
count+=1
print("Sentence ",count,"has ",len(w),"words")
count_lines("D:\Atharva\demo.txt")