我正在尝试运行一个内部使用NLTK标记的python脚本。以下是脚本中初始化NLTK
的部分代码# Using this set of field delimiters it is simple to access
# the error code in the previous last field
BEGIN { FS="[<>-]"}
# On lines which start with a '#'
/^#/ {
# We set the output (f)ilename to the error code
f=$(NF-1)
}
# On all lines ...
{
# ... append current line to (f)ilename
print >> f;
# Make sure to close the file to avoid running out of
# file descriptors in case there are many different error
# codes. If you are not concerned about that, you may
# comment out this line.
close(f)
}
我收到以下错误
class NLTKTagger:
'''
class that supplies part of speech tags using NLTK
note: avoids the NLTK downloader (see __init__ method)
'''
def __init__(self):
import nltk
from nltk.tag import PerceptronTagger
from nltk.tokenize import TreebankWordTokenizer
tokenizer_fn = os.path.abspath(resource_filename('phrasemachine.data', 'punkt.english.pickle'))
tagger_fn = os.path.abspath(resource_filename('phrasemachine.data', 'averaged_perceptron_tagger.pickle'))
# Load the tagger
self.tagger = PerceptronTagger(load=False)
self.tagger.load(tagger_fn)
# note: nltk.word_tokenize calls the TreebankWordTokenizer, but uses the downloader.
# Calling the TreebankWordTokenizer like this allows skipping the downloader.
# It seems the TreebankWordTokenizer uses PTB tokenization = regexes. i.e. no downloads
# https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L25
self.tokenize = TreebankWordTokenizer().tokenize
self.sent_detector = nltk.data.load(tokenizer_fn)
我在Windows 7和NLTK 3.2.1上使用Python 3.6。 我试过提到的解决方案 here和here 但都没有效果。还有其他解决办法吗?
答案 0 :(得分:2)
数据加载器误将路径中的C:
前缀误认为http:
等协议名称。我认为这已经修复了......为了避免这个问题,请在路径的开头添加file:"
协议。如,
self.tagger.load("file://"+tagger_fn)
(有更好的方法来构建代码,但这取决于你。)
从技术上讲,这不是一个错误,因为nltk.data.load()
需要一个URL,而不是文件系统路径。但实际上它应该被修复,处理Windows路径并不困难......