我正在尝试构建一个模型,该模型具有数据框的数字特征和数据框的文本特征的组合。但是,我在成功组合功能,使用功能进行培训,然后测试功能方面遇到很多麻烦。
现在,我正在尝试像这样使用DataFrameMapper:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn_pandas import DataFrameMapper
mapper = DataFrameMapper([
('body', TfidfVectorizer()),
('numeric_feature', None),
])
for train_index, test_index in kFold.split(DF['body']):
# Split the dataset by Kfold
X_train = even_rand[['body','numeric_feature']].iloc[train_index]
y_train = even_rand['sub_class'].iloc[train_index]
X_test = even_rand[['body','numeric_feature']].iloc[test_index]
y_test = even_rand['sub_class'].iloc[test_index]
# Vectorize/transform docs
X_train = mapper.fit_transform(X_train)
X_test = mapper.fit_transform(X_test)
# Get SVM
svm = SGDClassifier(loss='hinge', penalty='l2',
alpha=1e-3, n_iter=5, random_state=10)
svm.fit(X_train, y_train)
svm_score = svm.score(X_test, y_test)
这成功地组合了数据并训练了数据,但是当我尝试测试数据时,功能似乎无法正确匹配,并且出现了错误
ValueError:每个样本X具有49974个功能;期望87786
会有人知道如何解决此问题,或者知道将数字和文本特征组合/训练/测试在一起的更好方法吗?如果可能的话,我也想将特征保留为稀疏矩阵。
答案 0 :(得分:2)
代替:
OK, finally i figured out a way to do this using logback access.
Include the following dependency
<dependency>
<groupId>net.rakugakibox.spring.boot</groupId>
<artifactId>logback-access-spring-boot-starter</artifactId>
<version>2.7.0</version>
</dependency>
Also create a logback-access.xml in resources folder with following configuration.
<configuration>
<property resource="application.properties" />
<appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>logs/dev_access.log</file>
<rollingPolicy
class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<fileNamePattern>logs/Archive/dev_access_%d{yyyy-MM-dd}.log</fileNamePattern>
</rollingPolicy>
<encoder>
<pattern>%h %l %u %t "%r" %s %b %D</pattern>
</encoder>
</appender>
<appender-ref ref="FILE" />
</configuration>
尝试:
X_train = mapper.fit_transform(X_train)
X_test = mapper.fit_transform(X_test)