我刚刚开始研究检测网络钓鱼网站的分类项目。我正在使用uci数据集https://archive.ics.uci.edu/ml/machine-learning-databases/00327/Training%20Dataset.arff。 我正在尝试几个模型,如ANN,SVM,逻辑回归,我已经训练和测试了模型。
我的逻辑回归代码如下所示
#importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#importing the dataset
dataset = pd.read_csv("phishcoop.csv")
x = dataset.iloc[: , :-1].values
y = dataset.iloc[:, -1]
#Split the dataset into training and test
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25,
train_size =0.75, random_state = 0)
#fitting logistic regression into training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state =0)
classifier.fit(x_train, y_train)
#Predicting values for test data
y_pred = classifier.predict(x_test)
#checking accurancy using confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
现在我已经训练并测试了模型,我有一些问题
我是新手机器学习和第一次使用网址,所以如果我错了,请纠正我。
答案 0 :(得分:1)
听起来你只想解析一个URL。然后获取可能提供的主机名的IP。
对于python 3(对于Python 2,请参阅如何在此处导入:https://docs.python.org/2/library/urlparse.html)
from urllib.parse import urlparse, parse_qs
import socket
url = 'http://example.com/x/y?a=1&b=2'
# Parse the URL
parsed = urlparse('http://example.com/x/y?a=1&b=2&a=3')
# For the parameters
params = parse_qs(parsed.query)
print(params)
# For path components
# Note: Depending on the URL, this may have empty strings so that's why the
# filter is used
path_components = list(filter(bool, parsed.path.split('/')))
print(path_components)
# Location
print(parsed.netloc)
# IP
print(socket.gethostbyname(parsed.netloc))
将输出:
{'a': ['1', '3'], 'b': ['2']}
['x', 'y']
example.com
93.184.216.34