scikit学习逻辑回归模型tfidfvectorizer

时间:2018-09-01 07:06:51

标签: python machine-learning scikit-learn logistic-regression tfidfvectorizer

我正在尝试使用scikit Learn和以下代码创建逻辑回归模型。我将9列用于要素(X),将1列用于标签(Y)。尝试拟合时,即使先前使用X.transpose(),如果X和Y的长度相同,则也会收到错误消息“ ValueError:找到的输入变量样本数量不一致:[9,560000]”,但我得到一个不同的错误“ AttributeError:'int'对象没有属性'lower'”。我假设这可能与tfidfvectorizer有关,我这样做是因为3列中包含单个单词并且无法正常工作。这是执行此操作的正确方法,还是应该分别转换列中的单词,然后使用train_test_split?如果不是,我为什么会收到错误以及如何纠正错误。这是csv的示例。

df = pd.read_csv("UNSW-NB15_1.csv",header=None, names=cols, encoding = "UTF-8",low_memory=False) 

df.to_csv('netraf.csv')
csv = 'netraf.csv'
my_df = pd.read_csv(csv)

x_features = my_df.columns[1:10]
x_data = my_df[x_features]
Y = my_df["Label"]

x_train, x_validation, y_train, y_validation = 
model_selection.train_test_split(x_data, Y, test_size=0.2, random_state=7)

tfidf_vectorizer = TfidfVectorizer()
lr = LogisticRegression()
tfidf_lr_pipe = Pipeline([('tfidf', tfidf_vectorizer), ('lr', lr)])

tfidf_lr_pipe.fit(x_train, y_train)  

1 个答案:

答案 0 :(得分:0)

您尝试做的事情很不寻常,因为TfidfVectorizer旨在从文本中提取数字特征。但是,如果您不太在意,只是想让代码正常工作,一种方法是将数字数据转换为字符串,并将TfidfVectorizer配置为接受标记化数据:

import pandas as pd
from sklearn import model_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

cols = ['srcip','sport','dstip','dsport','proto','service','smeansz','dmeansz','attack_cat','Label']
df = pd.read_csv("UNSW-NB15_1.csv",header=None, names=cols, encoding = "UTF-8",low_memory=False) 

df.to_csv('netraf.csv')
csv = 'netraf.csv'
my_df = pd.read_csv(csv)

# convert all columns to string like we don't care
for col in my_df.columns:
    my_df[col] = my_df[col].astype(str)

# replace nan with empty string like we don't care
for col in my_df.columns[my_df.isna().any()].tolist():
    my_df.loc[:, col].fillna('', inplace=True)

x_features = my_df.columns[1:10]
x_data = my_df[x_features]
Y = my_df["Label"]

x_train, x_validation, y_train, y_validation = model_selection.train_test_split(
    x_data.values, Y.values, test_size=0.2, random_state=7)

# configure TfidfVectorizer to accept tokenized data
# reference http://www.davidsbatista.net/blog/2018/02/28/TfidfVectorizer/
tfidf_vectorizer = TfidfVectorizer(
    analyzer='word',
    tokenizer=lambda x: x,
    preprocessor=lambda x: x,
    token_pattern=None)

lr = LogisticRegression()
tfidf_lr_pipe = Pipeline([('tfidf', tfidf_vectorizer), ('lr', lr)])
tfidf_lr_pipe.fit(x_train, y_train)

话虽如此,我建议您使用另一种方法对数据集进行特征工程。例如,您可以尝试将to encode your nominal data(例如IP,端口)设置为数值。