Question

来自网站http://nltk.googlecode.com/svn/trunk/doc/book/ch05.html我已经了解了如何从已标记的语料库中拆分标记的单词。

网站上的代码：

>>> sent = '''
... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
... other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
... Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS
... said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
... interest/NN of/IN both/ABX governments/NNS ''/'' ./.
... '''
>>> [nltk.tag.str2tuple(t) for t in sent.split()]
  [('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'),
  ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]

这里我得到一个标记词的列表。我想要的是一个只包含单词的列表。例如：

  [('The'), ('grand'), ('jury')...

而不是

  ('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN')...

有任何建议我如何获得这个？

提前致谢。

Answer 1

我不是nltk专家，但您可以直接选择第一个元组元素：

[nltk.tag.str2tuple(t)[0] for t in sent.split()]

这将为您提供所有单词的列表：

['The', 'grand', 'jury'...

你要问的是有点混乱，因为你的输出示例中每个元素都包含在1元组中，我真的没有看到这一点。

编辑尽管larsman指出：('The',)是一个1元组，而('The') == 'The'。

使用nltk提取单词

1 个答案: