我有一项任务,我根本不懂如何去做。这是一个问题:
编写一个程序,计算并显示以下段落中的单词数和句子数:
图灵机是一种根据规则表操纵磁带上的符号的设备。尽管它很简单,但图灵机可以适用于模拟任何计算机算法的逻辑,并且在解释计算机内部CPU的功能时特别有用。 "图灵" 1936年,阿兰·图灵(Alan Turing)描述了这台机器,他称之为“自动机器人”(#utomatic)-machine"。图灵机不是用作实用的计算技术,而是用作代表计算机的假设设备。图灵机帮助计算机科学家了解机械计算的局限性。
以下是我已经写过的内容:
def main():
def word_count(str):
counts = dict()
words = str.split()
for word in words:
if word in counts:
counts[word] +=1
else:
counts[word] = 1
return counts
谢谢
答案 0 :(得分:1)
我认为,在不使用re
模块的情况下,更好的解决方案是实现以下功能:
def getWordCount(self):
return len(self.split())
def getSentenceCount(self):
return self.count(".")
print("Word count:",getWordCount(str),"\nSentence Count:",getSentenceCount(str))
打印:
Word count: 98
Sentence Count: 5
注意:对于句子,这假定最后一个句子以点(.
)结尾,并且没有其他点可以用于分隔句子。
def getSentenceCount(self):
return len(self.split("."))-1
!
,?
,;
和...
为了处理上述字符的出现,你应该做这样的事情,并为每个可能处理的角色重复这一点:
def getSentenceCount(self):
st=self.replace("?",".").replace("!",".").replace("...",".").replace(";",".").replace("?!",".")
return st.count(".")
希望这有帮助!
答案 1 :(得分:1)
另一种正则表达式解决方案,用于分隔标点符号并在计算单词时忽略大小写。我不确定你是想要总字数还是唯一字数,所以我做了两个......
我使用reqular表达式r"\w+"
来查找单词,使用collections.Counter
来计算单词。
import collections
import re
text = """A Turing machine is a device that manipulates symbols on a strip of tape according to a table of rules. Despite its simplicity, a Turing machine can be adapted to simulate the logic of any computer algorithm, and is particularly useful in explaining the functions of a CPU inside a computer. The "Turing" machine was described by Alan Turing in 1936, who called it an "a(utomatic)-machine". The Turing machine is not intended as a practical computing technology, but rather as a hypothetical device representing a computing machine. Turing machines help computer scientists understand the limits of mechanical computation."""
print("Number of words:", sum(1 for _ in re.finditer(r"\w+", text)))
unique_words = collections.Counter(match.group(0).lower() for match in re.finditer(r"\w+", text))
print("Number of unique words:", len(unique_words))
print("Unique words:", ', '.join(sorted(unique_words)))
print("Number of sentences:", sum(1 for _ in re.finditer(r"\.", text)))
运行它会导致......
$ python3 test.py
Number of words: 100
Number of unique words: 63
Unique words: 1936, a, according, adapted, alan, algorithm, an, and, any, as, be, but, by, called, can, computation, computer, computing, cpu, described, despite, device, explaining, functions, help, hypothetical, in, inside, intended, is, it, its, limits, logic, machine, machines, manipulates, mechanical, not, of, on, particularly, practical, rather, representing, rules, scientists, simplicity, simulate, strip, symbols, table, tape, technology, that, the, to, turing, understand, useful, utomatic, was, who
Number of sentences: 5
答案 2 :(得分:0)
在python中有一个名为" nltk"用于处理自然语言。 它包含用于处理文本的各种函数,其中一个函数称为" sentence_tokenize"。 如果您可以使用外部库,则可以轻松安装。
打开cmd并运行pip install nltk
之后,使用以下代码创建一个脚本并运行它:
import nltk
nltk.download()
只需按下下载并等待它完成。
之后,您可以使用nltk库进行文本/自然语言处理。
使用以下代码创建脚本并运行它:
import nltk
text = """A Turing machine is a device that manipulates symbols on a strip of tape according to a table of rules.
Despite its simplicity, a Turing machine can be adapted to simulate the logic of any computer algorithm, and is particularly useful in explaining the functions of a CPU inside a computer.
The "Turing" machine was described by Alan Turing in 1936, who called it an "a(utomatic)-machine".
The Turing machine is not intended as a practical computing technology, but rather as a hypothetical device representing a computing machine.
Turing machines help computer scientists understand the limits of mechanical computation.
"""
sentences = nltk.sent_tokenize(text) # this function will "tokenize" the text and pull all the sentences from it into a list
words = nltk.word_tokenize(text)
print("Number of sentences: " + len(sentences))
print("Number of words: " + len(words))
答案 3 :(得分:0)
你似乎在寻找某种解释,至少在某种程度上。由于您似乎是Python的新手,我将保持更简单的可能性。
许多Python程序都有这种基本结构。 (不是全部。)
def main():
# do something
return
if __name__ == '__main__':
main()
我怀疑你只是被恰当地教过了一部分。
main 函数可以调用其中的其他函数来执行特定任务。 main 可以接受来自其来电者的输入,在这种情况下,是要评估的段落。这些是我想提到的主要内容。
但最后,识别和计算单词(在Python中)的最简单方法可能是在其空白处分割一个字符串并记下结果数组的长度。没有必要的博士学位。这可能会在'自动机器'上绊倒;然而,人类也是如此。
def main(input):
print ('word count:', word_count(input))
print ('sentence count:', sentence_count(input))
return
def word_count(str):
a_count = len(str.split())
return a_count
def sentence_count(str):
# count the sentences
a_count = None
return a_count
if __name__ == '__main__':
paragraph ='''A Turing machine is a device that manipulates symbols on a strip of tape according to a table of rules. Despite its simplicity, a Turing machine can be adapted to simulate the logic of any computer algorithm, and is particularly useful in explaining the functions of a CPU inside a computer. The "Turing" machine was described by Alan Turing in 1936, who called it an "a(utomatic)-machine". The Turing machine is not intended as a practical computing technology, but rather as a hypothetical device representing a computing machine. Turing machines help computer scientists understand the limits of mechanical computation.'''
main(paragraph)
答案 4 :(得分:0)
以下Python脚本(我们称之为./counter.py)可以完成这项工作:
#!/usr/bin/python
import fileinput
import re
words =0
sents =0
for line in fileinput.input():
words += len(line.split())
sents += len(re.findall("(!\?)|(\.\.\.)|[\.?!]", line))
print 'Total words: ', words
print 'Total sentances: ', sents
假设段落存储在./test.txt文件中:
monty:python%> cat ./test.txt | ./counters.py
Total words: 98
Total sentances: 5
monty:python%>
答案 5 :(得分:0)
我是这样做的:
example_string = "This is an example string. it is cool." # This would be your very long string of text.
words = example_string.split(" ") #splits the string around every whitespace character
words = words.trim() #just in case there's extra whitespace
sentences = example.split(".") #splits the string around every "." character
sentences = sentences.trim() #just in case there's extra whitespace
numOfSentences = len(sentences) #gets the length of the array (number of sentences as an int
numOfWords = len(words) #gets the length of the array (number of sentences as an int
print(numOfWords)
print(numOfSentences)
输出应为:
8
2