包含引号的文本的句子标记化

时间:2015-08-14 06:04:00

标签: python nlp nltk tokenize

代码:

[After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.',
 'Finally they pushed you out of the cold emergency room.',
 'I failed to protect you.',
 '"Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.',]

输出:

" Li.

输入:

  在杜死于窒息之后,她的男朋友发布了令人心碎的事   在线留言:“失去意识在我的怀里,你的呼吸和   心跳变得越来越弱。最后他们把你赶出了   冷急诊室。我没有保护你。“

     李娜,23岁,来自江西省农民家庭的农民工,   期待在2015年结婚。

行情应包括在前一句中。而不是."

它在html = open(path, "r").read() #reads html code article = extractor.extract(raw_html=html) #extracts content text = unidecode(article.cleaned_text) #changes encoding 失败了如何解决这个问题?

修改 解释文本的提取。

['After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.',
 'Finally they pushed you out of the cold emergency room.',
 'I failed to protect you.',
 '"',
 'Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.'
]

这里,article.cleaned_text是unicode。使用它来改变字符“to”的想法。

解决方案@alvas错误结果:

python -c "import nltk; print nltk.__version__"
3.0.4
python -V
Python 2.7.9

EDIT2: (更新)nltk和python版本

Thread.sleep()

1 个答案:

答案 0 :(得分:4)

我不确定所需的输出是什么,但我认为在nltk.sent_tokenize之前你可能需要一些段落分段,即:

>>> text = """After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
... 
... Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015."""
>>> from nltk import sent_tokenize
>>> paragraphs = text.split('\n\n')
>>> for pg in paragraphs:
...     for sent in sent_tokenize(pg):
...             print sent
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.
Finally they pushed you out of the cold emergency room.
I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

可能您也可能需要strings within the double quotes,如果是这样,您可以尝试这样做:

>>> import re
>>> str_in_doublequotes = r'"([^"]*)"'
>>> re.findall(str_in_doublequotes, text)
['Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you.']

或者你可能需要这个:

>>> for pg in paragraphs:
...     # Collects the quotes inside the paragraph 
...     in_quotes = re.findall(str_in_doublequotes, pg)
...     for q in in_quotes:
...             # Keep track of the quotes with tabs.
...             pg = pg.replace('"{}"'.format(q), '\t')
...     for _pg in pg.split('\t'):
...             for sent in sent_tokenize(_pg):
...                     print sent
...             try:
...                     print '"{}"'.format(in_quotes.pop(0))
...             except IndexError: # Nothing to pop.
...                     pass
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: 
"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

从文件中读取时,请尝试使用io包:

alvas@ubi:~$ echo -e """After Du died of suffocation, her boyfriend posted a heartbreaking message online: \"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you.\"\n\nLi Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.""" > in.txt
alvas@ubi:~$ cat in.txt 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."

Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.
alvas@ubi:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> from nltk import sent_tokenize
>>> text = io.open('in.txt', 'r', encoding='utf8').read()
>>> print text
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."

Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

>>> for sent in sent_tokenize(text):
...     print sent
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.
Finally they pushed you out of the cold emergency room.
I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

用段落和引用提取黑客:

>>> import io, re
>>> from nltk import sent_tokenize
>>> str_in_doublequotes = r'"([^"]*)"'
>>> paragraphs = text.split('\n\n')
>>> for pg in paragraphs:
...     # Collects the quotes inside the paragraph 
...     in_quotes = re.findall(str_in_doublequotes, pg)
...     for q in in_quotes:
...             # Keep track of the quotes with tabs.
...             pg = pg.replace('"{}"'.format(q), '\t')
...     for _pg in pg.split('\t'):
...             for sent in sent_tokenize(_pg):
...                     print sent
...             try:
...                     print '"{}"'.format(in_quotes.pop(0))
...             except IndexError: # Nothing to pop.
...                     pass
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: 
"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

对于将引用前句与引号连接起来的魔法(不要眨眼,它看起来与上面完全相同):

>>> import io, re
>>> from nltk import sent_tokenize
>>> str_in_doublequotes = r'"([^"]*)"'
>>> paragraphs = text.split('\n\n')
>>> for pg in paragraphs:
...     # Collects the quotes inside the paragraph 
...     in_quotes = re.findall(str_in_doublequotes, pg)
...     for q in in_quotes:
...             # Keep track of the quotes with tabs.
...             pg = pg.replace('"{}"'.format(q), '\t')
...     for _pg in pg.split('\t'):
...             for sent in sent_tokenize(_pg):
...                     print sent,
...             try:
...                     print '"{}"'.format(in_quotes.pop(0))
...             except IndexError: # Nothing to pop.
...                     pass
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online:  "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

上述代码的问题在于它仅限于以下句子:

  在杜死于窒息之后,她的男朋友发布了令人心碎的事   在线留言:“失去意识在我的怀里,你的呼吸和   心跳变得越来越弱。最后他们把你赶出了   冷急诊室。我没有保护你。“

无法处理:

  “失去意识在我的怀里,你的呼吸和心跳变成了   越来越弱。最后他们把你赶出了寒冷的紧急情况   房间。我没有保护你,“她的男朋友发布了令人心碎的信息   杜死于窒息之后在线留言。

为了确保,我的python / nltk版本是:

$ python -c "import nltk; print nltk.__version__"
'3.0.3'
$ python -V
Python 2.7.6

除了文本处理的计算方面,问题中的文本语法有些微妙的不同。

引号后跟分号:的事实与传统的英语语法不同。这可能已经在中国新闻中普及,因为在中文中:

  

啊杜窒息死亡后,男友在网上发了令人心碎的消息:“......”

在语言意义上的传统英语中,它应该是:

  在杜死于窒息之后,她的男朋友发布了令人心碎的事   在线留言,“......”

后引号声明会以结尾逗号而非完整停止来表示,例如:

  

“......,”她的男朋友在杜后发布了一条令人心碎的消息   死于窒息。