Question

输入的句子列表：

sentences = [
    """Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!""",
    """Alice replied in a very melancholy voice. She continued, 'I'll try again.'"""
]

所需的输出：

How Doth the Little Busy Bee,
I'll try again.

有没有办法用nltk内置或第三方标记符提取引文（可以在单引号和双引号中出现）？

我尝试使用SExprTokenizer tokenizer提供单引号和双引号作为parens值，但结果远非理想，例如：

In [1]: from nltk import SExprTokenizer
    ...: 
    ...: 
    ...: sentences = [
    ...:     """Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!""",
    ...:     """Alice replied in a very melancholy voice. She continued, 'I'll try again.'"""
    ...: ]
    ...: 
    ...: tokenizer = SExprTokenizer(parens='""', strict=False)
    ...: for sentence in sentences:
    ...:     for item in tokenizer.tokenize(sentence):
    ...:         print(item)
    ...:     print("----")
    ...:     
Well,
I've
tried
to
say
"
How
Doth
the
Little
Busy
Bee,
"
 but it all came different!
----
Alice replied in a very melancholy voice. She continued, 'I'll try again.'

有类似this和this之类的线程，但所有这些线程都提出了一种基于正则表达式的方法，但是，我很好奇这是否只能通过nltk来解决 - 听起来像是自然语言处理中的常见任务。

Answer 1

嗯，在引擎盖下，SExprTokenizer也是一种基于正则表达式的方法，从您链接到的源代码可以看出。
从源头上可以看出，作者显然并不认为开头和结尾的“paren”用相同的字符表示。嵌套的深度在同一次迭代中增加和减少，因此标记化器看到的引用是空字符串。

我认为，在NLP中识别报价并不常见。人们以多种不同的方式使用引号（特别是如果你处理不同的语言......），所以很难用稳健的方法做到这一点。对于许多NLP应用程序，引用只是被忽略，我会说...

用nltk（不是正则表达式）提取引用/引用

1 个答案: