将TreeTagger与Python一起使用:无法找到Treetagger bin

时间:2017-04-10 08:39:50

标签: python nlp nltk treetagger

我试图在Python中使用TreeTagger。 我按照以下步骤进行安装: treetagger-python miotto

当我从命令提示符处使用TreeTagger时效果很好但是当我尝试从Python启动时,这就是我所拥有的:

Traceback (most recent call last): File "C:/Users/Marine/PycharmProjects/treetag/treetagtest.py", line 4, in <module> NLTK was unable to find the TreeTagger bin! pprint(tt_fr.tag(u'Mon Dieu, faites que ça marche!')) File "C:\Users\Marine\Anaconda3\lib\site-packages\treetagger.py", line 117, in tag p = Popen([self._treetagger_bin], AttributeError: 'TreeTagger' object has no attribute '_treetagger_bin'

这里是treetagger.py文件:

# -*- coding: utf-8 -*-
# Natural Language Toolkit: Interface to the TreeTagger POS-tagger
#
# Copyright (C) Mirko Otto
# Author: Mirko Otto <dropsy@gmail.com>

"""
A Python module for interfacing with the Treetagger by Helmut Schmid.
"""

import os
from subprocess import Popen, PIPE

from nltk.internals import find_binary, find_file
from nltk.tag.api import TaggerI
from sys import platform as _platform

_treetagger_url = 'http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/'

_treetagger_languages = ['bulgarian', 'dutch', 'english', 'estonian',     'finnish', 'french', 'galician', 'german', 'italian', 'polish', 'russian', 'slovak', 'slovak2', 'spanish']

class TreeTagger(TaggerI):
r"""
A class for pos tagging with TreeTagger. The default encoding used by TreeTagger is utf-8. The input is the paths to:
 - a language trained on training data
 - (optionally) the path to the TreeTagger binary

This class communicates with the TreeTagger binary via pipes.

Example:

.. doctest::
    :options: +SKIP

    >>> from treetagger import TreeTagger
    >>> tt = TreeTagger(language='english')
    >>> tt.tag('What is the airspeed of an unladen swallow ?')
    [['What', 'WP', 'What'],
     ['is', 'VBZ', 'be'],
     ['the', 'DT', 'the'],
     ['airspeed', 'NN', 'airspeed'],
     ['of', 'IN', 'of'],
     ['an', 'DT', 'an'],
     ['unladen', 'JJ', '<unknown>'],
     ['swallow', 'NN', 'swallow'],
     ['?', 'SENT', '?']]

.. doctest::
    :options: +SKIP

    >>> from treetagger import TreeTagger
    >>> tt = TreeTagger(language='german')
    >>> tt.tag('Das Haus hat einen großen hübschen Garten.')
    [['Das', 'ART', 'die'],
     ['Haus', 'NN', 'Haus'],
     ['hat', 'VAFIN', 'haben'],
     ['einen', 'ART', 'eine'],
     ['großen', 'ADJA', 'groß'],
     ['hübschen', 'ADJA', 'hübsch'],
     ['Garten', 'NN', 'Garten'],
     ['.', '$.', '.']]
"""

def __init__(self, path_to_home=None, language='german', 
             verbose=False, abbreviation_list=None):
    """
    Initialize the TreeTagger.

    :param path_to_home: The TreeTagger binary.
    :param language: Default language is german.

    The encoding used by the model. Unicode tokens
    passed to the tag() and batch_tag() methods are converted to
    this charset when they are sent to TreeTagger.
    The default is utf-8.

    This parameter is ignored for str tokens, which are sent as-is.
    The caller must ensure that tokens are encoded in the right charset.
    """
    treetagger_paths = ['.', '/usr/bin', '/usr/local/bin', '/opt/local/bin',
                    '/Applications/bin', '~/bin', '~/Applications/bin',
                    '~/work/tmp/treetagger/cmd', '~/treetagger/cmd', '~/treetagger/bin']
    treetagger_paths = list(map(os.path.expanduser, treetagger_paths))
    self._abbr_list = abbreviation_list

    if language in _treetagger_languages:
        if _platform == "win32":
            treetagger_bin_name = 'tag-' + language
        else:
            treetagger_bin_name = 'tree-tagger-' + language
    else:
        raise LookupError('Language not in language list!')

    try:
        self._treetagger_bin = find_binary(
            treetagger_bin_name, path_to_home,
            env_vars=('TREETAGGER', 'TREETAGGER_HOME'),
            searchpath=treetagger_paths,
            url=_treetagger_url,
            verbose=verbose)
    except LookupError:
        print('NLTK was unable to find the TreeTagger bin!')

def tag(self, sentences):
    """Tags a single sentence: a list of words.
    The tokens should not contain any newline characters.
    """

    # Write the actual sentences to the temporary input file
    if isinstance(sentences, list):
        _input = '\n'.join((x for x in sentences))
    else:
        _input = sentences

    # Run the tagger and get the output
    if(self._abbr_list is None):
        p = Popen([self._treetagger_bin], 
                    shell=False, stdin=PIPE, stdout=PIPE, stderr=PIPE)
    elif(self._abbr_list is not None):
        p = Popen([self._treetagger_bin,"-a",self._abbr_list], 
                    shell=False, stdin=PIPE, stdout=PIPE, stderr=PIPE)

    #(stdout, stderr) = p.communicate(bytes(_input, 'UTF-8'))
    (stdout, stderr) = p.communicate(str(_input).encode('utf-8'))

    # Check the return code.
    if p.returncode != 0:
        print(stderr)
        raise OSError('TreeTagger command failed!')

    treetagger_output = stdout.decode('UTF-8')

    # Output the tagged sentences
    tagged_sentences = []
    for tagged_word in treetagger_output.strip().split('\n'):
        tagged_word_split = tagged_word.split('\t')
        tagged_sentences.append(tagged_word_split)

    return tagged_sentences


if __name__ == "__main__":
import doctest
doctest.testmod(optionflags=doctest.NORMALIZE_WHITESPACE)

我猜我的配置有问题,但我无法弄清楚是什么。我在Windows工作,也许是关于treetagger_paths变量中路径格式的东西?我的bin文件在这里:C:\ treetagger \ bin,所以我添加了这个路径是treetagger_paths变量。

谢谢!

1 个答案:

答案 0 :(得分:0)

您的代码在哪里?几乎可以肯定的问题是你在“treetagger_paths变量中添加此路径的行”,并且您没有将其包含在您的问题中。我的猜测是你忘了使用原始字符串或者逃避反斜杠,因此你的“路径”包含一个不属于那里的标记(\t)。