用nltk(不是正则表达式)提取引用/引用

时间:2016-09-26 22:39:34

标签: python nltk tokenize

输入的句子列表:

sentences = [
    """Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!""",
    """Alice replied in a very melancholy voice. She continued, 'I'll try again.'"""
]

所需的输出:

How Doth the Little Busy Bee,
I'll try again.

有没有办法用nltk内置或第三方标记符提取引文(可以在单引号和双引号中出现)?

我尝试使用SExprTokenizer tokenizer提供单引号和双引号作为parens值,但结果远非理想,例如:

In [1]: from nltk import SExprTokenizer
    ...: 
    ...: 
    ...: sentences = [
    ...:     """Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!""",
    ...:     """Alice replied in a very melancholy voice. She continued, 'I'll try again.'"""
    ...: ]
    ...: 
    ...: tokenizer = SExprTokenizer(parens='""', strict=False)
    ...: for sentence in sentences:
    ...:     for item in tokenizer.tokenize(sentence):
    ...:         print(item)
    ...:     print("----")
    ...:     
Well,
I've
tried
to
say
"
How
Doth
the
Little
Busy
Bee,
"
 but it all came different!
----
Alice replied in a very melancholy voice. She continued, 'I'll try again.'

有类似thisthis之类的线程,但所有这些线程都提出了一种基于正则表达式的方法,但是,我很好奇这是否只能通过nltk来解决 - 听起来像是自然语言处理中的常见任务。

1 个答案:

答案 0 :(得分:1)

嗯,在引擎盖下,SExprTokenizer也是一种基于正则表达式的方法,从您链接到的源代码可以看出。
从源头上可以看出,作者显然并不认为开头和结尾的“paren”用相同的字符表示。 嵌套的深度在同一次迭代中增加和减少,因此标记化器看到的引用是空字符串。

我认为,在NLP中识别报价并不常见。 人们以多种不同的方式使用引号(特别是如果你处理不同的语言......),所以很难用稳健的方法做到这一点。 对于许多NLP应用程序,引用只是被忽略,我会说...