如何使用python将具有编号列表的段落标记为多个句子?

时间:2018-10-17 08:40:39

标签: python nlp nltk

我打算将段落分为多个句子。本段包含编号的句子,如下所示:

--out

我用下面的代码分解为句子:

Hello, How are you? Hope everything is good. I'm fine. 1.Hello World. 2.Good Morning John. 

Product is good but the managemnt is very lazy very bad. I dont like company service. They are giving fake promises. Next time i will not take any product. For Amazon service i will give 5 star dey give awsome service. But for sony company i will give 0 star... 1. Doesn't support all file formats when you connect USB 2. No other apps than YouTube and Netflix (requires subscription) 3. Screen mirroring is not up to the mark ( getting connected after once in 10 attempts 4. Good screen quality 5. Audio is very good 6. Bulky compared to other similar range 7. Price bit high due to brand value 8. its 1/4 smart TV. Not a full smart TV 9. Bad customer support 10. Remote control is very horrible to operate. it might be good for non smart TV 11. See the exchange value on amazon itself. LG gets 2ooo/- more than TV's 12. Also it was mentioned like 1+1 year warranty. But either support or Amazon support aren't clear about it. 13. Product information isn't up to 30% at least.There no installation. While I contact costumer Care.

此代码根据句号分割段落。但是带编号的句子造成了问题。由于这些数字后面有句号,因此分割方式不正确。

有人可以建议我吗?

2 个答案:

答案 0 :(得分:0)

您需要sent_tokenize

from nltk.tokenize import sent_tokenize

text = "Hello, How are you? Hope everything is good. I'm fine. 1.Hello World. 2.Good Morning John."

print(sent_tokenize(text))

输出

['Hello, How are you?', 'Hope everything is good.', "I'm fine.", '1.Hello World.', '2.Good Morning John.']

答案 1 :(得分:0)

@AkshayNevrekar @fervent send_tokenize默认情况下使用PunktSentenceTokenizer,因此您应具有相同的结果。 https://www.nltk.org/api/nltk.tokenize.html

  

nltk.tokenize.sent_tokenize(text,language ='english')[源代码]¶      使用NLTK推荐的句子标记器(当前为指定语言的PunktSentenceTokenizer)返回以句子标记的文本副本。

也许你们两个有不同的NLTK版本?

根据https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktSentenceTokenizer

  

此分词器将文本分为句子列表   通过使用无监督算法来建立缩写模型   单词,搭配和以句子开头的单词。一定是   受过目标语言的大量纯文本训练   在可以使用之前。

     

NLTK数据包包括一个预先训练的Punkt令牌生成器,用于   英语。

此模块使用机器学习算法来剪切文本。您使用已经训练有素的令牌生成器。如果您对结果不满意,则需要使用与要拆分的文本相似的文本集来自己训练此令牌生成器。在句子中拆分文本并非易事,并且您可能不会对这种算法感到百分百满意。您需要接受一些错误,因为很难预测其行为。

您可以尝试根据定义的规则实施自己的算法。举个例子(不是很完美,但是您期望的句子数):

import re
text = "Hello, How are you? Hope everything is good. I'm fine. 1.Hello World. 2.Good Morning John. Product is good but the managemnt is very lazy very bad. I dont like company service. They are giving fake promises. Next time i will not take any product. For Amazon service i will give 5 star dey give awsome service. But for sony company i will give 0 star... 1. Doesn't support all file formats when you connect USB 2. No other apps than YouTube and Netflix (requires subscription) 3. Screen mirroring is not up to the mark ( getting connected after once in 10 attempts 4. Good screen quality 5. Audio is very good 6. Bulky compared to other similar range 7. Price bit high due to brand value 8. its 1/4 smart TV. Not a full smart TV 9. Bad customer support 10. Remote control is very horrible to operate. it might be good for non smart TV 11. See the exchange value on amazon itself. LG gets 2ooo/- more than TV's 12. Also it was mentioned like 1+1 year warranty. But either support or Amazon support aren't clear about it. 13. Product information isn't up to 30% at least.There no installation. While I contact costumer Care."
print(list(re.findall('.*?[a-z].*?[0-9a-z][\?\.\!]+', text)))

使用这种算法更容易获得可预测的结果。但这对意外文本的效果不佳,因为很难找到适用于任何句子的规则。

为帮助您选择解决方案:

  • 您知道输入内容:尝试使用规则做自己的算法,并添加规则,直到对结果满意为止

  • 您将得到意想不到的输入:NLTK算法可能会做得更好,但是您不确定如何拆分文本。