Question

有没有人知道python中是否有任何现有的包来训练对数线性模型？我有一个包含2000个变量和1000个记录的数据集。我期待使用对数线性模型来估计频率。

Answer 1

如果您使用旧版本的SciPy（即0.10或更早版本），您可以使用scipy.maxentropy（在NLP中，MaxEnt =最大熵建模= Log-Linear模型）。当版本0.11.0发布时，模块已从SciPy中删除，然后SciPy团队advised使用sklearn.linear_model.LogisticRegression作为替换（注意both对数线性模型和逻辑回归是generalized linear models的例子，其中线性预测器之间的关系。）

Example使用SciPy的maxentropy模块（在SciPy 0.11.0中删除）：

#!/usr/bin/env python

""" Example use of the maximum entropy module:

    Machine translation example -- English to French -- from the paper 'A
    maximum entropy approach to natural language processing' by Berger et
    al., 1996.

    Consider the translation of the English word 'in' into French.  We
    notice in a corpus of parallel texts the following facts:

        (1)    p(dans) + p(en) + p(a) + p(au cours de) + p(pendant) = 1
        (2)    p(dans) + p(en) = 3/10
        (3)    p(dans) + p(a)  = 1/2

    This code finds the probability distribution with maximal entropy
    subject to these constraints.
"""

__author__ =  'Ed Schofield'
__version__=  '2.1'

from scipy import maxentropy

a_grave = u'\u00e0'

samplespace = ['dans', 'en', a_grave, 'au cours de', 'pendant']

def f0(x):
    return x in samplespace

def f1(x):
    return x=='dans' or x=='en'

def f2(x):
    return x=='dans' or x==a_grave

f = [f0, f1, f2]

model = maxentropy.model(f, samplespace)

# Now set the desired feature expectations
K = [1.0, 0.3, 0.5]

model.verbose = True

# Fit the model
model.fit(K)

# Output the distribution
print "\nFitted model parameters are:\n" + str(model.params)
print "\nFitted distribution is:"
p = model.probdist()
for j in range(len(model.samplespace)):
    x = model.samplespace[j]
    print ("\tx = %-15s" %(x + ":",) + " p(x) = "+str(p[j])).encode('utf-8')


# Now show how well the constraints are satisfied:
print
print "Desired constraints:"
print "\tp['dans'] + p['en'] = 0.3"
print ("\tp['dans'] + p['" + a_grave + "']  = 0.5").encode('utf-8')
print
print "Actual expectations under the fitted model:"
print "\tp['dans'] + p['en'] =", p[0] + p[1]
print ("\tp['dans'] + p['" + a_grave + "']  = " + str(p[0]+p[2])).encode('utf-8')
# (Or substitute "x.encode('latin-1')" if you have a primitive terminal.)

其他想法：http://homepages.inf.ed.ac.uk/lzhang10/maxent.html

Answer 2

我不确定这是否解决了您的问题，因为您提到了“机器学习”，而且不清楚您拥有什么样的数据。但既然你也提到了“预测”和“估计频率”，我猜猜插值可能会有所帮助。在这种情况下，您可以查看scipy.interpolate。

Rbf插值器是“用于径向基函数近似/ n维散乱数据插值的类......”。它支持以下功能：

'multiquadric': sqrt((r/self.epsilon)**2 + 1) 
'inverse':      1.0/sqrt((r/self.epsilon)**2 + 1)
'gaussian':     exp(-(r/self.epsilon)**2)
'linear':       r 
'cubic':        r**3 
'quintic':      r**5
'thin_plate':   r**2 * log(r)

有训练对数线性模型的python包吗？

2 个答案: