我尝试使用本教程对新项目中的文本进行分类:http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html它有助于我们在类别树中自动为给定文档选择合适的类别。
但是当我尝试创建循环时收到错误,这是我的分类器类的大部分内容:
import psycopg2
import psycopg2.extras
from sklearn.datasets import fetch_20newsgroups,load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn import metrics
from random import randint
import settings
class Classifier(object):
# Set Naive Bayes classifier
nb_classifier = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
random = randint(2, 9)
def __new__(cls):
inst = object.__new__(cls)
return inst
# Constructor
def __init__(self):
# Start connection with database
db_settings = "host='{}' dbname='{}' user='{}' password='{}'".format(settings.DB_HOST, settings.DB_TARGET, settings.DB_USER, settings.DB_PASS)
self.conn = psycopg2.connect(db_settings)
self.cursor = self.conn.cursor()
print(randint(2, 9))
# Get categorized data from db for training purposes
def getCategories(self,parent):
if parent == 0:
self.cursor.execute("""SELECT "categories"."id", concat_ws(', ', products.name::text) AS ab FROM "products"
INNER JOIN "product_categories" ON "products"."id" = "product_categories"."product_id"
INNER JOIN "categories" ON "product_categories"."category_id" = "categories"."id"
WHERE "parent" = 0""")
else:
self.cursor.execute("""SELECT "categories"."id", concat_ws(', ', products.name::text) AS ab FROM "products"
INNER JOIN "product_categories" ON "products"."id" = "product_categories"."product_id"
INNER JOIN "categories" ON "product_categories"."category_id" = "categories"."id"
WHERE "categories"."id" IN (SELECT * FROM (
WITH RECURSIVE relevant_taxonomy AS (
SELECT id
FROM categories
WHERE id = %s
UNION ALL
SELECT categories.id
FROM categories
INNER JOIN relevant_taxonomy ON relevant_taxonomy.id = categories."parent"
)
SELECT id FROM relevant_taxonomy
) AS subtree WHERE subtree.id != %s);""", (parent,parent,))
return self.cursor.fetchall()
# Train a classifier with train-data
def train_classifier(self, classifier, train_data):
## train given classifier with given data
trained_classifier = classifier.fit(train_data.data, train_data.target)
return trained_classifier
这是分类文件,我使用“分类器”类。 classify.py:
from traindata import Traindata
from classifier import Classifier
import numpy as np
from pprint import pprint
# Get all documents inside with category
def classify(cat, doc):
# Create instance of classifier
print('%r => %s' % (doc, cat))
classifier = Classifier()
rows = classifier.getCategories(cat)
if not rows:
print 'put document \n\n "%(1)s" \n\nin term_taxonomy id %(2)s' % {'1':doc, '2':cat}
return None
new_docs = [doc]
# set target id's
target_ids = []
myset = set()
for item in rows:
if item[0] not in myset:
target_ids.append(item[0])
myset.add(item[0])
# set train_data object
train_data = Traindata()
train_data.target_ids = target_ids;
targets = [];
for row in rows:
train_data.data.append(row[1])
index = train_data.target_ids.index(row[0])
targets.append(index)
print index
#end setting train_data object#
train_data.target = np.array(targets)
#train_data.target_ids
#train_data.target[:100]
print train_data
trained_classifier = classifier.train_classifier(classifier.nb_classifier, train_data)
predicted_cats = classifier.predict(new_docs,trained_classifier)
# pprint(zip(new_docs, predicted_cats))
print(train_data.target_ids)
for doc, category_index in zip(new_docs, predicted_cats):
if not train_data.target_ids[category_index]:
print 'not found'
# print('%r => %s' % (doc, train_data.target_ids[category_index]))
val = classify(train_data.target_ids[category_index],doc)
return train_data.target_ids[category_index]
for doc in ['Loopschoen']:
classify(0,doc)
我开始执行该函数一次,循环遍历新文档,如您在底部(for doc in ['Loopschoen']:
)所见,并且您可以看到我从没有父项(0)的类别开始,这是根节点。该函数返回一个想要放入文档的类别。但这只是类别树的顶层,所以我尝试用这个新值循环遍历该函数(所以它试图寻找所选类别的子节点) ),再次返回该功能。最后,当它找不到任何子类别时,它将返回最终类别。
但每次第二次循环失败并出现此错误。
错误:
ValueError: Found array with dim 46197. Expected 92394
循环是唯一的问题。导致第一个循环我收到类别编号,编号2.然后,如果我再次使用classify(2,doc)
运行脚本,我会收到下一个类别,在4或5次运行后,我会收到消息put document "Loopschoen" in term_taxonomy id 20
。如果我一遍又一遍地运行脚本并更改值,它就可以工作。但循环失败了......
有谁知道循环失败的原因?
编辑1:
我们知道它在分类器类中失败了:
trained_classifier = classifier.fit(train_data.data, train_data.target)
但我们无法弄清楚原因。
答案 0 :(得分:0)
发现问题,我不得不重置循环内的数组。它只是向train_data.data添加值,因此数字与train_data.target不同:
train_data = Traindata()
train_data.data = []
预计train_data.target的长度为80841,因此train_data.data包含80841项(前一循环中的项目)。