我有一个movie
数据表,其中有几列带有文本/类别变量的列。我使用sentenceTransformer
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')
df['movieName_embed'] = df.apply(lambda row : model.encode(row['movieName']), axis = 1)
df['usertags_embed'] = df.apply(lambda row : model.encode(row['usertags']), axis = 1)
经过这种嵌入插入和其他几种编码技术之后,dataframe
看起来像这样。
然后我创建特征的目标如下:
X = df[['movieName_embed', 'usertags_embed', 'rating']]
y = df[['genre_fe']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
movieName_embed
和usertags_embed
列为type: list of list of numbers
,不适合在xgboost中进行训练。因此,当我执行xgboost.XGBRegressor.fit(X_train,y_train)
时,我将遇到错误-
ValueError: DataFrame.dtypes for data must be int, float or bool.
Did not expect the data types in fields movieName_embed, usertags_embed
那么我该如何转换嵌入使其适合训练呢?