如何将先验信息纳入石榴中?换句话说:石榴是否支持增量学习?

时间:2019-12-11 20:09:33

标签: python pomegranate

假设我使用pomegranate使模型适合当时的可用数据。一旦有更多数据输入,我想相应地更新模型。换句话说,pomegranate是否可以用新数据更新现有模型而不覆盖先前的参数?只是要清楚一点:我不是指核心学习,因为我的问题是数据在不同的时间点可用,而不是在单个时间点有太大的内存数据可用。

这是我尝试过的:

>>> from pomegranate.distributions import BetaDistribution

>>> # suppose a coin generated the following data, where 1 is head and 0 is tail
>>> data1 = [0, 0, 0, 1, 0, 1, 0, 1, 0, 0]

>>> # as usual, we fit a Beta distribution to infer the bias of the coin
>>> model = BetaDistribution(1, 1)
>>> model.summarize(data1)  # compute sufficient statistics

>>> # presume we have seen all the data available so far,
>>> # we can now estimate the parameters
>>> model.from_summaries()

>>> # this results in the following model (so far so good)
>>> model
{
    "class" :"Distribution",
    "name" :"BetaDistribution",
    "parameters" :[
        3.0,
        7.0
    ],
    "frozen" :false
}

>>> # now suppose the coin is flipped a few more times, getting the following data
>>> data2 = [0, 1, 0, 0, 1]

>>> # we would like to update the model parameters accordingly
>>> model.summarize(data2)

>>> # but this fits only data2, overriding the previous parameters
>>> model.from_summaries()
>>> model
{
    "class" :"Distribution",
    "name" :"BetaDistribution",
    "parameters" :[
        2.0,
        3.0
    ],
    "frozen" :false
}


>>> # however I want to get the result that corresponds to the following,
>>> # but ideally without having to "drag along" data1
>>> data3 = data1 + data2
>>> model.fit(data3)
>>> model  # this should be the final model
{
    "class" :"Distribution",
    "name" :"BetaDistribution",
    "parameters" :[
        5.0,
        10.0
    ],
    "frozen" :false
}

修改

另一种提出问题的方式:pomegranate是否支持增量在线学习?基本上,我正在寻找与scikit-learn的{​​{1}}类似的东西,您可以找到here

鉴于partial_fit()支持out-of-core learning,我觉得自己正在忽略某些事情。有帮助吗?

1 个答案:

答案 0 :(得分:0)

实际上是问题select id from tablename group by id having sum(status is null) > 0 。对于Beta发行版,它会:from_summaries。所有self.summaries = [0, 0]方法都是破坏性的。他们将摘要替换为分布中的参数。汇总始终可以更新以进行其他观察,而参数不能更新。

我认为这是一个不好的设计。最好将它们作为观测值的累加器,并将参数作为派生的缓存值。

如果您这样做:

from_summaries

您会发现它产生的结果与使用model = BetaDistribution(1, 1) model.summarize(data1) model.summarize(data2) model.from_summaries() model 的结果相同。