对句子中的单词进行计数和平均

时间:2017-03-06 22:47:47

标签: python tokenize lexical-analysis

我必须使用Python来打印文本文件的每个句子中的单词数和平均单词长度。我无法使用NLTK或Regex进行此项任务。

  

文件中的句子以句号,感叹号或问号结尾。连字符,短划线或撇号不会结束句子。引号不会结束句子。但是,有些时期并没有结束句子。例如,Mrs.,Mr.,Dr。,Fr.,Jr.,St。都是常见的缩写。

例如,如果输入文本为:

[(no. of words, mean length of words in sentence1),
(no. of words, mean length of words in sentence2),
...]

...输出应该是:

p= ("Mrs.","Mr.","St.")
def punct_after_ab(texts):
    new_text = texts
    for abb in p:
        new_text = new_text.replace(abb,abb[:-1])
    return print(new_text)

import numpy
def word_list(text):
    special_characters = ["'",","]
    clean_text = text
    for string in special_characters:
        clean_text = clean_text.replace(string, "")
    count_list = [len(i) for i in clean_text.split()]
    count = [numpy.mean(count_list)]
    return print((count_list),(count))

代码:

java.lang.IllegalStateException: Couldn't read row 0, col 0 from CursorWindow.  Make sure the Cursor is initialized correctly before accessing data from it.

但是当我测试它时,它不会分裂句子。

1 个答案:

答案 0 :(得分:0)

使用.split(' ')行的内容来分隔单词(在所述情况下用空格分隔),然后使用数组操作和基本数学/统计来获得答案。如果您将问题更新为更具体并包含一些自己的代码,我愿意相应地修改我的答案。

如果你没有在你提出的问题上付出太多努力,你会发现在这个网站上,你不会得到非常有用的答案。在提出问题之前,尝试做一些研究并尽可能多地编写代码。这使人们更容易帮助你,他们会更愿意。截至目前,您似乎只是想让某人为您做功课。

  

<强>更新

您的代码大部分都有效,只需要更改一些内容。我玩了你所拥有的东西,我能够将文本分解为句子数组,你可以从中继续运行统计数据。

input.txt中:

My name? Mr. Bob. Your name? Mrs. Lily!
What's up?

test.py(我使用python 3.6):

    def punct_after_ab(texts):
        p = ("Mrs.", "Mr.", "St.")
        new_text = texts
        for abb in p:
            new_text = new_text.replace(abb,abb[:-1])
        return new_text


    def clean_text(text):
        special_characters = ["'", ","]
        clean_text = text
        for string in special_characters:
            clean_text = clean_text.replace(string, "")
        return clean_text


    def split_sentence(text):
    #Initialize vars
    sentences = []
    start = 0
    i = 0

    # Loop through the text until you find punctuation,
    # then add the sentence to the final array
    for char in text:
        if char == '.':
            sentences.append(text[start:i+1])
            start = i + 2
        if char == '?':
            sentences.append(text[start:i+1])
            start = i + 2
        if char == '!':
            sentences.append(text[start:i+1])
            start = i + 2
        i += 1

    # Print the sentences to console
    for sentence in sentences:
        print(sentence)


def main():
    # Ask user for file name
    file = input("Enter file name: ")
    # Open the file and strip newline chars
    fd = open(file).read()
    fd = fd.strip("\n")

    # Remove punctuation that doesn't delineate sentences
    text = punct_after_ab(fd)
    text = clean_text(text)

    # Separate sentences
    split_sentence(text)

# Run program
if __name__ == '__main__':
    main()

我能够输出以下文字:

Enter file name: input.txt
My name?
Mr Bob.
Your name?
Mrs Lily!
Whats up?

Process finished with exit code 0

从那里你可以轻松地进行句子统计。我只是输入了这个,所以你可能想要通过它并清理一下。我希望这会有所帮助。