我的问题是,它输出“ UserWarning:在所有培训示例中标签都不为0”。我不明白这意味着什么,这是我第一次编写机器学习代码。请帮助我毕业并学习,谢谢。
dataFrame = [] #list of data
categories = ['python', 'if-statement', 'for-loop', 'java']
for i in range(len(data["items"])):
#Convert html code to text since data["items"][i]["body"] returns something like this: "<p>I have 2 columns <p>""
html_to_text = h.handle(data["items"][i]["body"])
html_to_text = html_to_text.lower()
#converts "what's" to "what is", removes \t, and so on...
clean_text = preprocess_text(html_to_text)
data_dict = {'question_body' : clean_text, 'python' : [0], 'if-statement': [0], 'for-loop': [0], 'java' : [0]}
#change the label to 1 if it is a label of the question
for j in range(len(data["items"][i]["tags"])):
if data["items"][i]["tags"][j] in categories:
current_key_index = data["items"][i]["tags"][j]
data_dict[current_key_index] = 1
#convert to data frame using Pandas
from_data_dict = pd.DataFrame.from_dict(data_dict)
dataFrame.append(from_data_dict)
#train and test data split from scikit
train, test = train_test_split(dataFrame, test_size=0.33, shuffle=True)
#print(train)
X_train = []
X_test = []
for i in range(len(train)):
X_train.append(train[i].question_body)
# print(X_train[0])
for j in range(len(test)):
X_test.append(test[j].question_body)
SVC_pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
for category in categories:
print('... Processing {}'.format(category))
# train the model using X_dtm & y
for i in range(len(X_train)):
SVC_pipeline.fit(X_train[i], train[i][category])
在进行预测之前,我专注于首先创建模型,因此代码在此处结束。
答案 0 :(得分:2)
错误消息指出,在train_test_split函数中使用时,训练集中的所有样本均具有0标签。 作为调试,我建议打印train [i] [category],以确保训练集中至少有1个数字。
一般建议:在train_test_split中使用“ stratify = True”。这将迫使拆分同时包含两个类的一些样本。
如果标签中只有零,则应该仔细检查“ current_key_index”确实是字典的键。如果您的标签都没有切换到一个,则可能是由于以下原因导致的:data_dict [current_key_index] = 1
最后,将样本和标签链接在相同的数据结构(例如元组[sample,label])中,而不是像SVC_pipeline.fit(X_train [i],train [i] [类别])。由于索引不匹配,这将使错误最小化。