我目前正在python中编写一个NLP应用程序,需要快速的POS标记实现。标记器有一个c ++包装器接口:
#include "POSTagger.h"
extern "C" POSTagger* initTagger(const char* fileName, bool Normalize,
double BeamThreshold, bool SentStartHeuristic,
int MaxBeamSize)
{
FILE *file = open_file(fileName, "rb");
POSTagger* tagger = new POSTagger(file, Normalize, BeamThreshold,
SentStartHeuristic, MaxBeamSize);
fclose(file);
return tagger;
}
extern "C" void getTags(POSTagger* tagger, char** words, int sentLen,
char** tags)
{
Sentence sent(words, sentLen);
tagger->annotate(sent);
for( size_t i=0; i<sent.token.size(); i++ )
tags[i] = strdup(tagger->tagmap.name(sent.token[i].tag));
}
extern "C" void destroyTagger(POSTagger* tagger) {
delete tagger;
}
我从未在python中为c ++编写过包装器。所以有几个问题:
我可以在python中存储自定义C ++类实例吗? 我以前从未见过。我所做的所有示例都只返回了基本数据类型。 (这个pos标记器必须用语言集初始化,这需要一些时间将其加载到内存中。因此,只需初始化并存储它而不是重写包装器以为每个标记过程创建一个并且只返回一个标记字符串)
如果可能1:最简单的方法是什么?
答案 0 :(得分:3)
为了这个目的,我建议使用 Cython 。编写C / C ++扩展类型很简单。
不幸的是,我无法承认这段代码是完全正确的,因为我没有你使用的标题就无法测试它。 #coding:utf-8 #file:postagger.pyx
cimport libc.stdlib as stdlib
cdef extern from "Python.h":
char* PyString_AsString(object)
cdef extern from "yourPOSTagger.c":
# note the syntax, to Cython the POSTagger class will be available
# as cPOSTagger using this construct
cppclass cPOSTagger "POSTagger":
# we only need the POSTagger type to be available to cython
# but no attributes, so we leave to ``pass``
pass
cPOSTagger* initTagger(char*, bint, double, bint, int)
void getTags(cPOSTagger*, char**, int, char**)
void destroyTagger(cPOSTagger*)
cdef class POSTagger:
""" Wraps the POSTagger.h ``POSTagger`` class. """
cdef cPOSTagger* tagger
def __init__(self, char* fileName, bint Normalize, double BeamTreshold,
bint SentStartHeuristic, int MaxBeamSize):
self.tagger = initTagger( fileName, Normalize, BeamTreshold,
SentStartHeuristic, MaxBeamSize )
if self.tagger == NULL:
raise MemoryError()
def __del__(self):
destroyTagger(self.tagger)
def getTags(self, tuple words, int sentLen):
if not words:
raise ValueError("'words' can not be None.")
cdef char** _words = <char**> stdlib.malloc(sizeof(char*) * len(words))
cdef int i = 0
for item in words:
if not isinstance(item, basestring):
stdlib.free(_words)
raise TypeError( "Element in tuple 'words' must be of type "
"``basestring``." )
_words[i] = PyString_AsString(item)
i += 1
cdef int nTags = len(words) # Really? Dunno..
cdef char** tags = <char**> stdlib.malloc(sizeof(char*) * len(words))
getTags(self.tagger, _words, sentLen, tags)
cdef list reval = []
cdef str temp
for 0 <= i < nTags:
temp = tags[i]
reval.append(temp)
stdlib.free(tags[i])
stdlib.free(tags)
return reval
您需要使用 Cython 的--cplus
标志编译此代码。
编辑:更正了代码,Cython不再提供错误。
答案 1 :(得分:0)
最简单的方法是创建一个可以在Python中推送的opaque类型,但客户端并不需要关心。