Question

我想检索基本ntlk语句标记生成器的范围（我知道使用pst标记生成器是可行的，但是基本标记生成器做得更好）。是否可以在span_tokenize上运行sent_tokenize方法？

from nltk import sent_tokenize
sentences = nltk.sent_tokenize(text)

Answer 1

假设您想要单词的跨度。

from nltk.tokenize import WhitespaceTokenizer as wt
from nltk import sent_tokenize
sentences = sent_tokenize("This is a sentence. This is another sentence. The sky is blue.")
print(list(wt().span_tokenize_sents(sentences)))

输出：

[[(0, 4), (5, 7), (8, 9), (10, 19)], [(0, 4), (5, 7), (8, 15), (16, 25)], [(0, 3), (4, 7), (8, 10), (11, 16)]]

请参见https://www.nltk.org/api/nltk.tokenize.html。搜索span_tokenize_sents。

Answer 2

对于句子跨度，您可以使用 JFrame frame = new JFrame(); frame.setDefaultCloseOperation(WindowConstants.EXIT_ON_CLOSE); frame.setSize(300, 300); ImageIcon icon = ImageIO.read(new File("icon.png")); frame.setIconImage(icon); JDesktopPane dp = new JDesktopPane(); frame.add(dp); frame.setVisible(true); if(InternalDialog.showInternalConfirmDialog(frame.getContentPane(), "Are you sure?", "Confirm", JOptionPane.YES_NO_OPTION, JOptionPane.QUESTION_MESSAGE, icon) == 0) { //Do something } 中的 span_tokenize()：https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktSentenceTokenizer。

以下代码

nltk.tokenize.punkt.PunktSentenceTokenizer

会给你输出：

    from nltk.tokenize.punkt import PunktSentenceTokenizer as pt
    full_text = "This is your text. You will split it into sentences. And get their spans."
    spans = list(pt().span_tokenize(full_text))
    print(spans)

句子分词器检索范围

2 个答案: