输入的句子列表:
sentences = [
"""Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!""",
"""Alice replied in a very melancholy voice. She continued, 'I'll try again.'"""
]
所需的输出:
How Doth the Little Busy Bee,
I'll try again.
有没有办法用nltk
内置或第三方标记符提取引文(可以在单引号和双引号中出现)?
我尝试使用SExprTokenizer
tokenizer提供单引号和双引号作为parens
值,但结果远非理想,例如:
In [1]: from nltk import SExprTokenizer
...:
...:
...: sentences = [
...: """Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!""",
...: """Alice replied in a very melancholy voice. She continued, 'I'll try again.'"""
...: ]
...:
...: tokenizer = SExprTokenizer(parens='""', strict=False)
...: for sentence in sentences:
...: for item in tokenizer.tokenize(sentence):
...: print(item)
...: print("----")
...:
Well,
I've
tried
to
say
"
How
Doth
the
Little
Busy
Bee,
"
but it all came different!
----
Alice replied in a very melancholy voice. She continued, 'I'll try again.'
有类似this和this之类的线程,但所有这些线程都提出了一种基于正则表达式的方法,但是,我很好奇这是否只能通过nltk
来解决 - 听起来像是自然语言处理中的常见任务。
答案 0 :(得分:1)
嗯,在引擎盖下,SExprTokenizer
也是一种基于正则表达式的方法,从您链接到的源代码可以看出。
从源头上可以看出,作者显然并不认为开头和结尾的“paren”用相同的字符表示。
嵌套的深度在同一次迭代中增加和减少,因此标记化器看到的引用是空字符串。
我认为,在NLP中识别报价并不常见。 人们以多种不同的方式使用引号(特别是如果你处理不同的语言......),所以很难用稳健的方法做到这一点。 对于许多NLP应用程序,引用只是被忽略,我会说...