Question

我正在尝试训练一个天真的贝叶斯分类器，我遇到了数据问题。我计划将它用于提取文本摘要。

Example_Input: It was a sunny day. The weather was nice and the birds were singing.
Example_Output: The weather was nice and the birds were singing.

我有一个我打算使用的数据集，在每个文档中至少有一个句子用于摘要。

我决定使用sklearn，但我不知道如何表示我拥有的数据。即X和y。

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X, y)

最接近我的想法就是这样：

X = [
        'It was a sunny day. The weather was nice and the birds were singing.',
        'I like trains. Hi, again.'
    ]

y = [
        [0,1],
        [1,0]
    ]

其中目标值表示1 - 包含在摘要中，0 - 不包括在内。不幸的是，由于y预计是1-d阵列，因此不幸会出现形状异常。我想不出一种代表它的方式所以请帮忙。

顺便说一句，我不直接使用X中的字符串值，而是将其表示为来自sklearn的CountVectorizer和TfidfTransformer的向量。

Answer 1

根据您的要求，您正在对数据进行分类。这意味着，您需要将每个句子分开以预测它的类别。

例如：
而不是使用：

X = [
        'It was a sunny day. The weather was nice and the birds were singing.',
        'I like trains. Hi, again.'
    ]

使用如下：

X = [
        'It was a sunny day.',
        'The weather was nice and the birds were singing.',
        'I like trains.',
        'Hi, again.'
    ]

使用NLTK的句子标记器来实现这一点。

现在，对于标签，请使用两个类。假设1表示是，0表示否。

y = [
        [0,],
        [1,],
        [1,],
        [0,]
    ]

现在，使用这些数据来拟合和预测您想要的方式！

希望它有所帮助！

朴素贝叶斯分类器提取总结

1 个答案: