使用TfidVectorizer()但出现错误'ValueError:每个样本X具有3个功能;期待3231926'

时间:2019-04-19 04:15:40

标签: python machine-learning scikit-learn feature-extraction

我的项目正在构建分类器以对可以安全访问的url进行分类,这是使用SVMLight格式训练模型的数据集

http://www.sysnet.ucsd.edu/projects/url/#datasets” (如何将这些数据集读取为文本?)

我尝试使用TfidfVectorizer()将url输入转换为特征向量,但在预测时出现此错误。

"ValueError: X has 3 features per sample; expecting 3231926"

我正在尝试使用

tvect = HashingVectorizer(n_features=n_features-26)

和SGDClassifier可以预测输入,但始终分类为“ 1”类(可以安全访问网址)

我尝试输入“ http://www.batmanporn.com”,“ http://www.virusparty.com”之类的恶意网址,但仍归为“ 1”类

import tarfile
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report
from sklearn.datasets import load_svmlight_file
#from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

uri = 'C:\\Users\\faceb\\svm_light\\url_svmlight.tar.gz'
tar = tarfile.open(uri,"r:gz")
max_obs = 0
max_vars = 0
i = 0
split = 5
for tarinfo in tar:
    print ("extracting %s, f size %s" % (tarinfo.name, tarinfo.size))
    if tarinfo.isfile():
        f = tar.extractfile(tarinfo.name)
        X,y = load_svmlight_file(f)
        max_vars = np.maximum(max_vars, X.shape[0])
        max_obs = np.maximum(max_obs, X.shape[1])
    if(i > split):
        break
    i+= 1
print ("max X = %s, max y dimension = %s" % (max_obs, max_vars))
classes = [-1,1]
sgd = SGDClassifier(loss="log")
n_features = max_obs
i = 0
for tarinfo in tar:
    if i>split:
        break
    if tarinfo.isfile():
        f = tar.extractfile(tarinfo.name)
        data = load_svmlight_file(f,n_features=n_features)
        print ("%s,%s" % (data[0],data[1]))
        if i < split:
            sgd.partial_fit(X, y, classes=classes)
        if i == split:
            print (classification_report(sgd.predict(X),y))
    i += 1      

url = ["nothing here"]
url[0] = input('Enter url : ')
#vectorizer = HashingVectorizer(n_features=n_features-26)
tvect = TfidfVectorizer(max_features=n_features)
tvect.fit(url)
vector = tvect.transform(url)
new_predict = sgd.predict(vector)
if(new_predict == 1): // always true
    print("%s is safe to visit" % (url[0]))
elif(new_predict == -1):
    print("%s is not safe to visit" % (url[0]))

我需要将输入转换为具有期望特征的特征向量,并且我的分类器也可以预测

0 个答案:

没有答案