如何保存Python NLTK对齐模型供以后使用?

时间:2015-05-12 15:25:51

标签: python io nlp nltk machine-translation

在Python中,我使用NLTK's alignment module在并行文本之间创建单词对齐。对齐bitexts可能是一个耗时的过程,尤其是在相当多的语料库上完成时。最好在一天内进行批量对齐,然后再使用这些对齐。

class Message:
    def __init__(self, type=None, length=None, data=None):
        self.type = type
        self.length = length
        self.data = data

创建模型后,我如何(1)将其保存到磁盘并(2)稍后重复使用?

3 个答案:

答案 0 :(得分:7)

最直接的答案是腌制它,见https://wiki.python.org/moin/UsingPickle

但是因为IBMModel1返回一个lambda函数,所以不可能使用默认的pickle / cPickle来腌制它(参见https://github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L74https://github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L104

所以我们将使用dill。首先,安装dill,参见Can Python pickle lambda functions?

$ pip install dill
$ python
>>> import dill as pickle

然后:

>>> import dill
>>> import dill as pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('model1.pk', 'wb') as fout:
...     pickle.dump(ibm, fout)
...
>>> exit()

使用腌制模型:

>>> import dill as pickle
>>> from nltk.corpus import comtrans
>>> bitexts = comtrans.aligned_sents()[:100]
>>> with open('model1.pk', 'rb') as fin:
...     ibm = pickle.load(fin)
... 
>>> aligned_sent = ibm.align(bitexts[0])
>>> aligned_sent.words
['Wiederaufnahme', 'der', 'Sitzungsperiode']

如果你试图挑选IBMModel1对象,这是一个lambda函数,你最终会得到这个:

>>> import cPickle as pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('model1.pk', 'wb') as fout:
...     pickle.dump(ibm, fout)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python2.7/copy_reg.py", line 70, in _reduce_ex
    raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle function objects

(注意:上面的代码片段来自NLTK 3.0.0版)

在带有NLTK 3.0.0的python3中,您也将面临同样的问题,因为IBMModel1返回一个lambda函数:

alvas@ubi:~$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('mode1.pk', 'wb') as fout:
...     pickle.dump(ibm, fout)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
_pickle.PicklingError: Can't pickle <function IBMModel1.train.<locals>.<lambda> at 0x7fa37cf9d620>: attribute lookup <lambda> on nltk.align.ibm1 failed'

>>> import dill
>>> with open('model1.pk', 'wb') as fout:
...     dill.dump(ibm, fout)
... 
>>> exit()

alvas@ubi:~$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> from nltk.corpus import comtrans
>>> with open('model1.pk', 'rb') as fin:
...     ibm = dill.load(fin)
... 
>>> bitexts = comtrans.aligned_sents()[:100]
>>> aligned_sent = ibm.aligned(bitexts[0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'IBMModel1' object has no attribute 'aligned'
>>> aligned_sent = ibm.align(bitexts[0])
>>> aligned_sent.words
['Wiederaufnahme', 'der', 'Sitzungsperiode']

(注意:在python3中,picklecPickle,请参阅http://docs.pythonsprints.com/python3_porting/py-porting.html

答案 1 :(得分:3)

您讨论了保存对齐模型,但您的问题似乎更多地是关于保存已对齐的对齐的bitexts:“在一天中进行批量对齐并稍后使用这些对齐会很好。”我要回答这个问题。

在nltk环境中,使用类似语料库的资源的最佳方式是使用语料库阅读器访问它。 NLTK没有附带语料库编写器,但NLTK AlignedCorpusReader支持的格式很容易生成:(NLTK 3版本)

model = ibm(biverses, 20)  # As in your question

out = open("folder/newalignedtext.txt", "w")
for pair in biverses:
    asent = model.align(pair)
    out.write(" ".join(asent.words)+"\n")
    out.write(" ".join(asent.mots)+"\n")
    out.write(str(asent.alignment)+"\n")

out.close()

就是这样。您可以稍后重新加载和使用对齐的句子,就像您使用comtrans语料库一样:

from nltk.corpus.reader import AlignedCorpusReader

mycorpus = AlignedCorpusReader(r"folder", r".*\.txt")
biverses_reloaded = mycorpus.aligned_sents()

如您所见,您不需要对齐器对象本身。 对齐的句子可以加载语料库阅读器, 除非你愿意,否则对准器本身就没用了 研究嵌入概率。

评论:我不确定我会将对齐器对象称为“模型”。在NLTK 2中,对齐器未设置为对齐新文本 - 它甚至没有align()方法。在NLTK 3中,函数align()可以对齐新文本,但仅在python 2中使用; 在Python 3中它被破坏,显然是因为比较不同类型对象的规则更加严格。如果你想要能够腌制和重新加载对准器,我会很乐意将它添加到我的答案中;从我所看到的可以用香草cPickle完成。

答案 2 :(得分:1)

如果你愿意,看起来像它,你可以将它存储为AlignedSent列表:

from nltk.align import IBMModel1 as IBM
from nltk.align import AlignedSent
import dill as pickle

biverses = [list of AlignedSent objects]
model = ibm(biverses, 20)

for sent in range(len(biverses)):
     biverses[sent].alignment = model.align(biverses[sent]).alignment

之后,你可以用莳萝作为泡菜保存它:

with open('alignedtext.pk', 'wb') as arquive:
     pickle.dump(biverses, arquive)