Question

我有一个数据集，其中包含文本消息的集合。我想计算每个句子的平均单词。但每条消息的格式都不同。也就是说，有些消息以fullstop结束，有些消息不是......

例如消息：

          Tiwary to rcb.battle between bang and kochi
          Dhawan for dc:)
          Warner to delhi.
          make it fast...

使用，

   words = messages.split() #get each words in the sentence
   leg_wrd = len(words)

但是找到句子的结尾是有问题的，因为它不相似。那我怎么能识别句子的结尾呢？以及如何使用python 2.7计算相同内容。

Answer 1

这不是一个小问题。我建议使用像NTLK这样的第三方库。这有一个句子标记器，它的工作方式如下：

# Make sure that you have NLTK
from nltk.tokenize import sent_tokenize
text = “this’s a sent tokenize test. this is sent two. is this sent three? sent 4 is cool! Now it’s your turn.”

sent_tokenize_list = sent_tokenize(text)

print(sent_tokenize_list)
# Will output [“this’s a sent tokenize test.”, ‘this is sent two.’, ‘is this sent three?’, ‘sent 4 is cool!’, “Now it’s your turn.”]

每个句子的平均单词数

1 个答案: