我想检索基本ntlk
语句标记生成器的范围(我知道使用pst标记生成器是可行的,但是基本标记生成器做得更好)。是否可以在span_tokenize
上运行sent_tokenize
方法?
from nltk import sent_tokenize
sentences = nltk.sent_tokenize(text)
答案 0 :(得分:0)
假设您想要单词的跨度。
from nltk.tokenize import WhitespaceTokenizer as wt
from nltk import sent_tokenize
sentences = sent_tokenize("This is a sentence. This is another sentence. The sky is blue.")
print(list(wt().span_tokenize_sents(sentences)))
输出:
[[(0, 4), (5, 7), (8, 9), (10, 19)], [(0, 4), (5, 7), (8, 15), (16, 25)], [(0, 3), (4, 7), (8, 10), (11, 16)]]
请参见https://www.nltk.org/api/nltk.tokenize.html。搜索span_tokenize_sents。
答案 1 :(得分:0)
对于句子跨度,您可以使用 JFrame frame = new JFrame();
frame.setDefaultCloseOperation(WindowConstants.EXIT_ON_CLOSE);
frame.setSize(300, 300);
ImageIcon icon = ImageIO.read(new File("icon.png"));
frame.setIconImage(icon);
JDesktopPane dp = new JDesktopPane();
frame.add(dp);
frame.setVisible(true);
if(InternalDialog.showInternalConfirmDialog(frame.getContentPane(),
"Are you sure?", "Confirm", JOptionPane.YES_NO_OPTION,
JOptionPane.QUESTION_MESSAGE, icon) == 0)
{
//Do something
}
中的 span_tokenize()
:https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktSentenceTokenizer。
以下代码
nltk.tokenize.punkt.PunktSentenceTokenizer
会给你输出:
from nltk.tokenize.punkt import PunktSentenceTokenizer as pt
full_text = "This is your text. You will split it into sentences. And get their spans."
spans = list(pt().span_tokenize(full_text))
print(spans)