我试图将LightGBM用于文本多分类。 在pandas数据框中有2列,其中'类别'和'内容'设置如下。
数据框:
plugins: [
new webpack.DefinePlugin({
BASENAME: JSON.stringify("/appname/env1/")
}),
我在此尝试将文本分为3类,如下所示。
代码:
contents category
1 this is example1... A
2 this is example2... B
3 this is example3... C
*Actual data frame consists of approx 600 rows and 2 columns.
然后我收到错误:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
stopwords1 = set(stopwords.words('english'))
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
import lightgbm as lgbm
from lightgbm import LGBMClassifier, LGBMRegressor
#--main code--#
X_train, X_test, Y_train, Y_test = train_test_split(df['contents'], df['category'], random_state = 0, test_size=0.3, shuffle=True)
count_vect = CountVectorizer(ngram_range=(1,2), stop_words=stopwords1)
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer(use_idf=True, smooth_idf=True, norm='l2', sublinear_tf=True)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
lgbm_train = lgbm.Dataset(X_train_tfidf, Y_train)
lgbm_eval = lgbm.Dataset(count_vect.transform(X_test), Y_test, reference=lgbm_train)
params = {
'boosting_type':'gbdt',
'objective':'multiclass',
'learning_rate': 0.02,
'num_class': 3,
'early_stopping': 100,
'num_iteration': 2000,
'num_leaves': 31,
'is_enable_sparse': 'true',
'tree_learner': 'data',
'max_depth': 4,
'n_estimators': 50
}
clf_gbm = lgbm.train(params, valid_sets=lgbm_eval)
predicted_LGBM = clf_gbm.predict(count_vect.transform(X_test))
print(accuracy_score(Y_test, predicted_LGBM))
我也转换了'类别'列[' a',' b',' c']将int设为[0,1,2],但出现错误
ValueError: could not convert string to float: 'b'
我的代码有什么问题?
任何意见/建议将不胜感激
提前谢谢。
答案 0 :(得分:2)
我成功处理了这个问题。非常简单但在此处注明以供参考。
由于LightGBM期望float32 / 64用于输入,因此'categories'应该是数字,而不是str。 输入数据应使用.astype()转换为float32 / 64。
<强> Changes1:强>
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf = X_train_tfidf.astype('float32')
X_test_counts = X_test_counts.astype('float32')
Y_train = Y_train.astype('float32')
Y_test = Y_test.astype('float32')
<强> changes2:
只需将“类别”列从[A,B,C,...]转换为[0.0,1.0,2.0,...]
也许只是将attirbute指定为TfidfVecotrizer(dtype = np.float32)在这种情况下有效。
将矢量化数据放到LGBMClassifier中会简单得多。
<强>更新强>
使用TfidfVectorizer要简单得多:
tfidf_vec = TfidfVectorizer(dtype=np.float32, sublinear_tf=True, use_idf=True, smooth_idf=True)
X_data_tfidf = tfidf_vec.fit_transform(df['contents'])
X_train_tfidf = tfidf_vec.transform(X_train)
X_test_tfidf = tfidf_vec.transform(X_test)
clf_LGBM = lgbm.LGBMClassifier(objective='multiclass', verbose=-1, learning_rate=0.5, max_depth=20, num_leaves=50, n_estimators=120, max_bin=2000,)
clf_LGBM.fit(X_train_tfidf, Y_train, verbose=-1)
predicted_LGBM = clf_LGBM.predict(X_test_tfidf)