nltk'未知网址'错误

时间:2017-02-21 14:46:52

标签: python nltk

我正在尝试运行一个内部使用NLTK标记的python脚本。以下是脚本中初始化NLTK

的部分代码
# Using this set of field delimiters it is simple to access
# the error code in the previous last field
BEGIN { FS="[<>-]"}

# On lines which start with a '#'
/^#/ {
    # We set the output (f)ilename to the error code
    f=$(NF-1)
}

# On all lines ...
{
    # ... append current line to (f)ilename
    print >> f;

    # Make sure to close the file to avoid running out of
    # file descriptors in case there are many different error
    # codes. If you are not concerned about that, you may
    # comment out this line.
    close(f)
}

我收到以下错误

class NLTKTagger:
'''
class that supplies part of speech tags using NLTK
note: avoids the NLTK downloader (see __init__ method)
'''
def __init__(self):
    import nltk
    from nltk.tag import PerceptronTagger
    from nltk.tokenize import TreebankWordTokenizer
    tokenizer_fn = os.path.abspath(resource_filename('phrasemachine.data', 'punkt.english.pickle'))
    tagger_fn = os.path.abspath(resource_filename('phrasemachine.data', 'averaged_perceptron_tagger.pickle'))
    # Load the tagger
    self.tagger = PerceptronTagger(load=False)
    self.tagger.load(tagger_fn)

    # note: nltk.word_tokenize calls the TreebankWordTokenizer, but uses the downloader.
    #       Calling the TreebankWordTokenizer like this allows skipping the downloader.
    #       It seems the TreebankWordTokenizer uses PTB tokenization = regexes. i.e. no downloads
    #       https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L25
    self.tokenize = TreebankWordTokenizer().tokenize
    self.sent_detector = nltk.data.load(tokenizer_fn)

我在Windows 7和NLTK 3.2.1上使用Python 3.6。 我试过提到的解决方案 herehere 但都没有效果。还有其他解决办法吗?

1 个答案:

答案 0 :(得分:2)

数据加载器误将路径中的C:前缀误认为http:等协议名称。我认为这已经修复了......为了避免这个问题,请在路径的开头添加file:"协议。如,

self.tagger.load("file://"+tagger_fn)

(有更好的方法来构建代码,但这取决于你。)

从技术上讲,这不是一个错误,因为nltk.data.load()需要一个URL,而不是文件系统路径。但实际上它应该被修复,处理Windows路径并不困难......