NLTK的分类器如何转换大量功能来训练模型?

时间:2017-08-28 00:22:23

标签: python classification nltk dimensionality-reduction

在文档示例(第6章,第1.3章)中,他们选择训练集中的前2000个单词作为要素,数据集为2000行。如果训练所有2000个单词作为列和2000行,由于维度的诅咒,它将无法找到任何模式。但它的工作和工作真的很棒。所以我检查了源代码并找到了下面的代码,这对我没有任何意义。

对我来说,看起来他们正在将单词的频率映射到每个标签的概率。但这应该是朴素贝叶斯算法。这是如何训练的?

 def train(cls, labeled_featuresets, estimator=ELEProbDist):
    """
    :param labeled_featuresets: A list of classified featuresets,
        i.e., a list of tuples ``(featureset, label)``.
    """
    label_freqdist = FreqDist()
    feature_freqdist = defaultdict(FreqDist)
    feature_values = defaultdict(set)
    fnames = set()

    # Count up how many times each feature value occurred, given
    # the label and featurename.
    for featureset, label in labeled_featuresets:
        label_freqdist[label] += 1
        for fname, fval in featureset.items():
            # Increment freq(fval|label, fname)
            feature_freqdist[label, fname][fval] += 1
            # Record that fname can take the value fval.
            feature_values[fname].add(fval)
            # Keep a list of all feature names.
            fnames.add(fname)

    # If a feature didn't have a value given for an instance, then
    # we assume that it gets the implicit value 'None.'  This loop
    # counts up the number of 'missing' feature values for each
    # (label,fname) pair, and increments the count of the fval
    # 'None' by that amount.
    for label in label_freqdist:
        num_samples = label_freqdist[label]
        for fname in fnames:
            count = feature_freqdist[label, fname].N()
            # Only add a None key when necessary, i.e. if there are
            # any samples with feature 'fname' missing.
            if num_samples - count > 0:
                feature_freqdist[label, fname][None] += num_samples - count
                feature_values[fname].add(None)

    # Create the P(label) distribution
    label_probdist = estimator(label_freqdist)

    # Create the P(fval|label, fname) distribution
    feature_probdist = {}
    for ((label, fname), freqdist) in feature_freqdist.items():
        probdist = estimator(freqdist, bins=len(feature_values[fname]))
        feature_probdist[label, fname] = probdist

    return cls(label_probdist, feature_probdist)

1 个答案:

答案 0 :(得分:1)

Naive Bayes算法不进行任何特征选择。我不知道"模式"你正在想象当你写那个"由于维数的诅咒,它无法找到任何模式,"但它确实使用了其模型中提供的所有功能。

通过组合每个单独特征的概率估计来估计标签概率,好像这些特征是相互统计独立的(即模型中的"天真"部分) 。这使得构建朴素贝叶斯概率模型变得非常快速和容易。具有更强预测能力的特征(与其中一个结果标签更强烈关联)对计算概率的影响更大。任何标签概率大致相同的特征对估计的影响可以忽略不计。

也许你的意思是,如果一个单词仅出现在一个培训文档中,那么它将与该文档的标签虚假关联?这是事实,但只有在被分类的文档中出现该单词时才会导致问题。大多数情况下,更广泛的单词将决定概率估计。