假设我有一个标记语料库(如棕色语料库),我想提取仅用'/ nn'标记的单词。例如:
Daniel/np termed/vbd ``/`` extremely/rb conservative/jj ''/'' his/pp$ estimate/nn.....
这是标记语料库'brown'的一部分。我想要做的是提取单词,如估计(因为它用/ nn标记)并将它们添加到列表中。但是大多数例子我发现它通常都是关于标记语料库的。看到这些例子,我真的很困惑。 任何人都可以通过提供一个关于从标记语料库中提取单词的示例或教程来帮助我。
提前致谢。
答案 0 :(得分:3)
请参阅:http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html
>>> sent = '''
... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
... other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
... Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS
... said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
... interest/NN of/IN both/ABX governments/NNS ''/'' ./.
... '''
>>> [nltk.tag.str2tuple(t) for t in sent.split()]
[('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'),
('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]
如果您只想要标记NN
的人,可以这样做:
>>> [nltk.tag.str2tuple(t) for t in sent.split() if t.split('/')[1] == 'NN']
[('jury', 'NN'), ('number', 'NN'), ('interest', 'NN')]
修改强>
以下sent
为字符串减去省略号。
sent = """The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT interest/NN of/IN both/ABX governments/NNS ''/'' ./."""