我正在尝试训练一个天真的贝叶斯分类器,我遇到了数据问题。我计划将它用于提取文本摘要。
Example_Input: It was a sunny day. The weather was nice and the birds were singing.
Example_Output: The weather was nice and the birds were singing.
我有一个我打算使用的数据集,在每个文档中至少有一个句子用于摘要。
我决定使用sklearn,但我不知道如何表示我拥有的数据。即X和y。
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X, y)
最接近我的想法就是这样:
X = [
'It was a sunny day. The weather was nice and the birds were singing.',
'I like trains. Hi, again.'
]
y = [
[0,1],
[1,0]
]
其中目标值表示1 - 包含在摘要中,0 - 不包括在内。不幸的是,由于y预计是1-d阵列,因此不幸会出现形状异常。我想不出一种代表它的方式所以请帮忙。
顺便说一句,我不直接使用X
中的字符串值,而是将其表示为来自sklearn的CountVectorizer
和TfidfTransformer
的向量。
答案 0 :(得分:1)
根据您的要求,您正在对数据进行分类。这意味着,您需要将每个句子分开以预测它的类别。
例如:
而不是使用:
X = [
'It was a sunny day. The weather was nice and the birds were singing.',
'I like trains. Hi, again.'
]
使用如下:
X = [
'It was a sunny day.',
'The weather was nice and the birds were singing.',
'I like trains.',
'Hi, again.'
]
使用NLTK的句子标记器来实现这一点。
现在,对于标签,请使用两个类。假设1表示是,0表示否。
y = [
[0,],
[1,],
[1,],
[0,]
]
现在,使用这些数据来拟合和预测您想要的方式!
希望它有所帮助!