Python:在随机段落中找到最长/最短的句子?

时间:2014-12-26 02:02:14

标签: python python-2.7

我正在使用Python 2.7并且需要2个函数来在随机段落中找到最长和最短句子(就字数而言)。例如,如果我选择加入此段落:

“将你的海边逃生与加利福尼亚州北部葡萄酒乡的红色和白色结合在一起。这个位于索诺玛县的小型沿海城市坐落在俄罗斯河口附近,整个夏天都在那里,海豹和吠叫加州海狮在沙滩上吐痰,晒太阳几个小时。你可以在罗斯堡州立历史公园游泳和徒步旅行,了解早期的俄罗斯猎人,他们被吸引到该地区的毛皮海豹群中。堡垒的葡萄园,葡萄藤约会回到1817年,是加州种植葡萄的第一批地方之一。“

此输出应为 36 16 ,其中36表示最长句中有36个单词,最短句中有16个单词。

3 个答案:

答案 0 :(得分:4)

你需要一种方法将段落分成句子并计算句子中的单词。你可以use nltk package两个:

from nltk.tokenize import sent_tokenize, word_tokenize # $ pip install nltk

sentences = sent_tokenize(paragraph)
word_count = lambda sentence: len(word_tokenize(sentence))
print(min(sentences, key=word_count)) # the shortest sentence by word count
print(max(sentences, key=word_count)) # the longest sentence by word count

答案 1 :(得分:3)

def MaxMinWords(paragraph):
    numWords = [len(sentence.split()) for sentence in paragraph.split('.')]
    return max(numWords), min(numWords)
编辑:正如许多人在评论中指出的那样,这种解决方案远非强大。这个片段的目的是简单地用作指向OP的指针。

答案 2 :(得分:1)

编辑:正如下面的评论中所提到的,以编程方式确定段落中句子的构成是一项非常复杂的任务。但是,根据您提供的示例,我已经阐明了一个很好的开始,可能在下面解决您的问题。

首先,我们想将段落标记为句子。我们通过在.(句点)的每次出现时拆分文本来完成此操作。这将返回一个字符串列表,每个字符串都是一个句子。

然后我们想将每个句子分成相应的单词列表。然后,使用这个列表列表,我们想要长度最大的句子(表示为单词列表)和长度最小的句子。请考虑以下代码:

par = "Pair your seaside escape with the reds and whites of northern California's wine country in Jenner. This small coastal city in Sonoma County sits near the mouth of the Russian River, where, all summer long, harbor seals and barking California sea lions heave themselves onto the sand spit, sunning themselves for hours. You can swim and hike at Fort Ross State Historic Park and learn about early Russian hunters who were drawn to the area's herds of seal for their fur pelts. The fort's vineyard, with vines dating back to 1817, was one of the first places in California where grapes were planted."

# split paragraph into sentences
sentences = par.split(". ")

# split each sentence into words
tokenized_sentences = [sentence.split(" ") for sentence in sentences]

# get longest sentence and its length
longest_sen = max(tokenized_sentences, key=len)
longest_sen_len = len(longest_sen)

# get shortest word and its length
shortest_sen = min(tokenized_sentences, key=len)
shortest_sen_len = len(shortest_sen)

print longest_sen_len
print shortest_sen_len