Question

在python中我试图接收一个文本文件。在搜索每个角色时，当我找到一个首都时，我想跟踪之后的字符数量，直到找到＆＃39;？＆＃39;，＆＃39;！＆＃39;或＆＃ 39;＆＃39;基本上，我正在阅读大量的文本文件，并试图计算有多少句子和总字符来查找平均句子长度。（我知道会有一些错误，例如Mr.或E.g.，但我可以忍受错误。数据集太大，错误可以忽略不计。）

char = ''
for line in sys.stdin:
  words = line
  for char in words:
    if char.isupper():
      # read each char until you see a ?,!, or . and keep track 
      # of the number of characters in the sentence.

Answer 1

您可能希望使用nltk模块来标记句子而不是尝试重新发明轮子。它涵盖了各种角落案例，如括号和其他奇怪的句子结构。

它有一个句子标记符nltk.sent_tokenize。请注意，在使用之前，您必须使用nltk.download()下载英语模型。

以下是使用nltk解决问题的方法：

 sentences = nltk.sent_tokenize(stdin.read())

 print sum( len(s) for s in sentences ) / float(len(sentences))

Answer 2

如果你想像stdin一样逐行地使用这个解决方案，就像当前的代码一样。它使用双状态机计算断点。

import sys

in_a_sentence = False
count = 0
lengths = []

for line in sys.stdin:
    for char in line:
        if char.isupper():
            in_a_sentence = True
        elif char in '.?!':
            lengths.append(count+1)
            in_a_sentence = False
            count = 0

        if in_a_sentence:
            count += 1

print lengths

输出：

mbp:scratch geo$ python ./count.py
This is a test of the counter. This test includes
line breaks. See? Pretty awesome,
huh!
^D[30, 31, 4, 20]

但如果你能够将整个事物一次性读入一个变量，你可以做更多的事情：

import re
import sys

data = sys.stdin.read()
lengths = [len(x) for x in re.findall(r'[A-Z][^.?!]*[.?!]', data)]

print lengths

那会给你相同的结果。

如何计算条件后的字符？

2 个答案: