我的项目正在构建分类器以对可以安全访问的url进行分类,这是使用SVMLight格式训练模型的数据集
“ http://www.sysnet.ucsd.edu/projects/url/#datasets” (如何将这些数据集读取为文本?)
我尝试使用TfidfVectorizer()将url输入转换为特征向量,但在预测时出现此错误。
"ValueError: X has 3 features per sample; expecting 3231926"
我正在尝试使用
tvect = HashingVectorizer(n_features=n_features-26)
和SGDClassifier可以预测输入,但始终分类为“ 1”类(可以安全访问网址)
我尝试输入“ http://www.batmanporn.com”,“ http://www.virusparty.com”之类的恶意网址,但仍归为“ 1”类
import tarfile
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report
from sklearn.datasets import load_svmlight_file
#from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
uri = 'C:\\Users\\faceb\\svm_light\\url_svmlight.tar.gz'
tar = tarfile.open(uri,"r:gz")
max_obs = 0
max_vars = 0
i = 0
split = 5
for tarinfo in tar:
print ("extracting %s, f size %s" % (tarinfo.name, tarinfo.size))
if tarinfo.isfile():
f = tar.extractfile(tarinfo.name)
X,y = load_svmlight_file(f)
max_vars = np.maximum(max_vars, X.shape[0])
max_obs = np.maximum(max_obs, X.shape[1])
if(i > split):
break
i+= 1
print ("max X = %s, max y dimension = %s" % (max_obs, max_vars))
classes = [-1,1]
sgd = SGDClassifier(loss="log")
n_features = max_obs
i = 0
for tarinfo in tar:
if i>split:
break
if tarinfo.isfile():
f = tar.extractfile(tarinfo.name)
data = load_svmlight_file(f,n_features=n_features)
print ("%s,%s" % (data[0],data[1]))
if i < split:
sgd.partial_fit(X, y, classes=classes)
if i == split:
print (classification_report(sgd.predict(X),y))
i += 1
url = ["nothing here"]
url[0] = input('Enter url : ')
#vectorizer = HashingVectorizer(n_features=n_features-26)
tvect = TfidfVectorizer(max_features=n_features)
tvect.fit(url)
vector = tvect.transform(url)
new_predict = sgd.predict(vector)
if(new_predict == 1): // always true
print("%s is safe to visit" % (url[0]))
elif(new_predict == -1):
print("%s is not safe to visit" % (url[0]))
我需要将输入转换为具有期望特征的特征向量,并且我的分类器也可以预测