我正在玩Natural Language Toolkit(NLTK)。
NLTK的使用/应用是否有任何好的但基本的例子?我正在考虑 Stream Hacker 博客上的NTLK articles之类的内容。
答案 0 :(得分:28)
这是我自己的实际例子,让其他人看到这个问题的好处(原文示例,这是我在Wikipedia找到的第一件事):
import nltk
import pprint
tokenizer = None
tagger = None
def init_nltk():
global tokenizer
global tagger
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+|[^\w\s]+')
tagger = nltk.UnigramTagger(nltk.corpus.brown.tagged_sents())
def tag(text):
global tokenizer
global tagger
if not tokenizer:
init_nltk()
tokenized = tokenizer.tokenize(text)
tagged = tagger.tag(tokenized)
tagged.sort(lambda x,y:cmp(x[1],y[1]))
return tagged
def main():
text = """Mr Blobby is a fictional character who featured on Noel
Edmonds' Saturday night entertainment show Noel's House Party,
which was often a ratings winner in the 1990s. Mr Blobby also
appeared on the Jamie Rose show of 1997. He was designed as an
outrageously over the top parody of a one-dimensional, mute novelty
character, which ironically made him distinctive, absurd and popular.
He was a large pink humanoid, covered with yellow spots, sporting a
permanent toothy grin and jiggling eyes. He communicated by saying
the word "blobby" in an electronically-altered voice, expressing
his moods through tone of voice and repetition.
There was a Mrs. Blobby, seen briefly in the video, and sold as a
doll.
However Mr Blobby actually started out as part of the 'Gotcha'
feature during the show's second series (originally called 'Gotcha
Oscars' until the threat of legal action from the Academy of Motion
Picture Arts and Sciences[citation needed]), in which celebrities
were caught out in a Candid Camera style prank. Celebrities such as
dancer Wayne Sleep and rugby union player Will Carling would be
enticed to take part in a fictitious children's programme based around
their profession. Mr Blobby would clumsily take part in the activity,
knocking over the set, causing mayhem and saying "blobby blobby
blobby", until finally when the prank was revealed, the Blobby
costume would be opened - revealing Noel inside. This was all the more
surprising for the "victim" as during rehearsals Blobby would be
played by an actor wearing only the arms and legs of the costume and
speaking in a normal manner.[citation needed]"""
tagged = tag(text)
l = list(set(tagged))
l.sort(lambda x,y:cmp(x[1],y[1]))
pprint.pprint(l)
if __name__ == '__main__':
main()
输出:
[('rugby', None),
('Oscars', None),
('1990s', None),
('",', None),
('Candid', None),
('"', None),
('blobby', None),
('Edmonds', None),
('Mr', None),
('outrageously', None),
('.[', None),
('toothy', None),
('Celebrities', None),
('Gotcha', None),
(']),', None),
('Jamie', None),
('humanoid', None),
('Blobby', None),
('Carling', None),
('enticed', None),
('programme', None),
('1997', None),
('s', None),
("'", "'"),
('[', '('),
('(', '('),
(']', ')'),
(',', ','),
('.', '.'),
('all', 'ABN'),
('the', 'AT'),
('an', 'AT'),
('a', 'AT'),
('be', 'BE'),
('were', 'BED'),
('was', 'BEDZ'),
('is', 'BEZ'),
('and', 'CC'),
('one', 'CD'),
('until', 'CS'),
('as', 'CS'),
('This', 'DT'),
('There', 'EX'),
('of', 'IN'),
('inside', 'IN'),
('from', 'IN'),
('around', 'IN'),
('with', 'IN'),
('through', 'IN'),
('-', 'IN'),
('on', 'IN'),
('in', 'IN'),
('by', 'IN'),
('during', 'IN'),
('over', 'IN'),
('for', 'IN'),
('distinctive', 'JJ'),
('permanent', 'JJ'),
('mute', 'JJ'),
('popular', 'JJ'),
('such', 'JJ'),
('fictional', 'JJ'),
('yellow', 'JJ'),
('pink', 'JJ'),
('fictitious', 'JJ'),
('normal', 'JJ'),
('dimensional', 'JJ'),
('legal', 'JJ'),
('large', 'JJ'),
('surprising', 'JJ'),
('absurd', 'JJ'),
('Will', 'MD'),
('would', 'MD'),
('style', 'NN'),
('threat', 'NN'),
('novelty', 'NN'),
('union', 'NN'),
('prank', 'NN'),
('winner', 'NN'),
('parody', 'NN'),
('player', 'NN'),
('actor', 'NN'),
('character', 'NN'),
('victim', 'NN'),
('costume', 'NN'),
('action', 'NN'),
('activity', 'NN'),
('dancer', 'NN'),
('grin', 'NN'),
('doll', 'NN'),
('top', 'NN'),
('mayhem', 'NN'),
('citation', 'NN'),
('part', 'NN'),
('repetition', 'NN'),
('manner', 'NN'),
('tone', 'NN'),
('Picture', 'NN'),
('entertainment', 'NN'),
('night', 'NN'),
('series', 'NN'),
('voice', 'NN'),
('Mrs', 'NN'),
('video', 'NN'),
('Motion', 'NN'),
('profession', 'NN'),
('feature', 'NN'),
('word', 'NN'),
('Academy', 'NN-TL'),
('Camera', 'NN-TL'),
('Party', 'NN-TL'),
('House', 'NN-TL'),
('eyes', 'NNS'),
('spots', 'NNS'),
('rehearsals', 'NNS'),
('ratings', 'NNS'),
('arms', 'NNS'),
('celebrities', 'NNS'),
('children', 'NNS'),
('moods', 'NNS'),
('legs', 'NNS'),
('Sciences', 'NNS-TL'),
('Arts', 'NNS-TL'),
('Wayne', 'NP'),
('Rose', 'NP'),
('Noel', 'NP'),
('Saturday', 'NR'),
('second', 'OD'),
('his', 'PP$'),
('their', 'PP$'),
('him', 'PPO'),
('He', 'PPS'),
('more', 'QL'),
('However', 'RB'),
('actually', 'RB'),
('also', 'RB'),
('clumsily', 'RB'),
('originally', 'RB'),
('only', 'RB'),
('often', 'RB'),
('ironically', 'RB'),
('briefly', 'RB'),
('finally', 'RB'),
('electronically', 'RB-HL'),
('out', 'RP'),
('to', 'TO'),
('show', 'VB'),
('Sleep', 'VB'),
('take', 'VB'),
('opened', 'VBD'),
('played', 'VBD'),
('caught', 'VBD'),
('appeared', 'VBD'),
('revealed', 'VBD'),
('started', 'VBD'),
('saying', 'VBG'),
('causing', 'VBG'),
('expressing', 'VBG'),
('knocking', 'VBG'),
('wearing', 'VBG'),
('speaking', 'VBG'),
('sporting', 'VBG'),
('revealing', 'VBG'),
('jiggling', 'VBG'),
('sold', 'VBN'),
('called', 'VBN'),
('made', 'VBN'),
('altered', 'VBN'),
('based', 'VBN'),
('designed', 'VBN'),
('covered', 'VBN'),
('communicated', 'VBN'),
('needed', 'VBN'),
('seen', 'VBN'),
('set', 'VBN'),
('featured', 'VBN'),
('which', 'WDT'),
('who', 'WPS'),
('when', 'WRB')]
答案 1 :(得分:18)
NLP通常非常有用,因此您可能希望将搜索范围扩展到文本分析的一般应用程序。我使用NLTK通过提取概念图生成文件分类来帮助MOSS 2010。它运作得很好。文件开始以有用的方式聚类不需要很长时间。
通常,为了理解文本分析,您必须考虑您习惯于思考的方式。例如,文本分析对于发现非常有用。但是,大多数人甚至不知道搜索和发现之间的区别。如果您阅读了这些主题,您可能会“发现”您希望将NLTK用于工作的方式。
另外,考虑一下没有NLTK的文本文件的世界观。你有一堆由空格和标点符号分隔的随机长度字符串。一些标点符号会改变它的使用方式,例如句点(也是缩写的小数点和后缀标记。)使用NLTK,您可以获得单词以及更多内容,从而获得词性。现在您可以处理内容了。使用NLTK发现文档中的概念和操作。使用NLTK来获取文档的“含义”。在这种情况下的含义是指文档中的基本关系。
对NLTK感到好奇是一件好事。 Text Analytics将在未来几年内大举突破。理解它的人将更适合更好地利用新机会。
答案 2 :(得分:14)
我是streamhacker.com的作者(感谢您的提及,我从这个特定问题获得了大量的点击流量)。具体你想做什么? NLTK有许多工具可用于执行各种操作,但在某些方面缺乏关于使用工具的清晰信息,以及如何最好地使用它们。它也面向学术问题,因此将pedagogical例子转化为实际解决方案可能会很重要。