我是这方面的初学者,但我想创建一个文件夹,其中有很多文本(可以说小说保存为.txt)。然后,我想让用户选择其中一个小说,然后自动让词性标注器分析整个文本。这可能吗?我一直在尝试:
text = nltk.word_tokenize("And now for something completely different")
nltk.pos_tag(text)
如何分析用户选择的文本而不是句子? 我如何导入这些文本?
答案 0 :(得分:2)
有几种方法可以读取文本文件目录。
让我们首先从终端/控制台/命令提示符尝试本机python方式:
~$ mkdir ~/testcorpora
~$ cd ~/testcorpora/
~/testcorpora$ ls
~/testcorpora$ echo 'this is a foo foo bar bar.\n bar foo, dah dah.' > somefoobar.txt
~/testcorpora$ echo 'what are you talking about?' > talkingabout.txt
~/testcorpora$ ls
somefoobar.txt talkingabout.txt
~/testcorpora$ cd ..
~$ python
>>> import os
>>> from nltk.tokenize import word_tokenize
>>> from nltk.tag import pos_tag
>>> corpus_directory = 'testcorpora/'
>>> for infile in os.listdir(corpus_directory):
... with open(corpus_directory+infile, 'r') as fin:
... pos_tag(word_tokenize(fin.read()))
...
[('what', 'WP'), ('are', 'VBP'), ('you', 'PRP'), ('talking', 'VBG'), ('about', 'IN'), ('?', '.')]
[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('foo', 'NN'), ('bar', 'NN'), ('bar.\\n', 'NN'), ('bar', 'NN'), ('foo', 'NN'), (',', ','), ('dah', 'NN'), ('dah', 'NN'), ('.', '.')]
另一种解决方案是在NLTK中使用PlaintextCorpusReader
,然后在语料库中运行word_tokenize
和pos_tag
,请参阅Creating a new corpus with NLTK:
>>> from nltk.corpus.reader.plaintext import PlaintextCorpusReader
>>> from nltk.tag import pos_tag
>>> corpusdir = 'testcorpora/'
>>> newcorpus = PlaintextCorpusReader(corpusdir,'.*')
>>> dir(newcorpus)
['CorpusView', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_encoding', '_fileids', '_get_root', '_para_block_reader', '_read_para_block', '_read_sent_block', '_read_word_block', '_root', '_sent_tokenizer', '_tag_mapping_function', '_word_tokenizer', 'abspath', 'abspaths', 'encoding', 'fileids', 'open', 'paras', 'raw', 'readme', 'root', 'sents', 'words']
# POS tagging all the words in all text files at the same time.
>>> newcorpus.words()
['this', 'is', 'a', 'foo', 'foo', 'bar', 'bar', '.\\', ...]
>>> pos_tag(newcorpus.words())
[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('foo', 'NN'), ('bar', 'NN'), ('bar', 'NN'), ('.\\', ':'), ('n', 'NN'), ('bar', 'NN'), ('foo', 'NN'), (',', ','), ('dah', 'NN'), ('dah', 'NN'), ('.', '.'), ('what', 'WP'), ('are', 'VBP'), ('you', 'PRP'), ('talking', 'VBG'), ('about', 'IN'), ('?', '.')]