如何从python中的URL中提取特征?

时间:2018-04-22 14:46:24

标签: python url machine-learning classification

我刚刚开始研究检测网络钓鱼网站的分类项目。我正在使用uci数据集https://archive.ics.uci.edu/ml/machine-learning-databases/00327/Training%20Dataset.arff。 我正在尝试几个模型,如ANN,SVM,逻辑回归,我已经训练和测试了模型。

  

我的逻辑回归代码如下所示

#importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#importing the dataset
dataset = pd.read_csv("phishcoop.csv")
x = dataset.iloc[: , :-1].values
y = dataset.iloc[:, -1]

#Split the dataset into training and test
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, 
train_size =0.75, random_state = 0)

#fitting logistic regression into training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state =0)
classifier.fit(x_train, y_train)

#Predicting values for test data
y_pred = classifier.predict(x_test)

#checking accurancy using confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

现在我已经训练并测试了模型,我有一些问题

  • 如何从用户将输入的网址中提取数据集中的30个要素
  • python中是否有用于此目的的库,这将帮助我提取这些功能

我是新手机器学习和第一次使用网址,所以如果我错了,请纠正我。

1 个答案:

答案 0 :(得分:1)

听起来你只想解析一个URL。然后获取可能提供的主机名的IP。

对于python 3(对于Python 2,请参阅如何在此处导入:https://docs.python.org/2/library/urlparse.html

from urllib.parse import urlparse, parse_qs
import socket


url = 'http://example.com/x/y?a=1&b=2'

# Parse the URL
parsed = urlparse('http://example.com/x/y?a=1&b=2&a=3')

# For the parameters
params = parse_qs(parsed.query)
print(params)

# For path components
# Note: Depending on the URL, this may have empty strings so that's why the
# filter is used
path_components = list(filter(bool, parsed.path.split('/')))
print(path_components)

# Location
print(parsed.netloc)

# IP
print(socket.gethostbyname(parsed.netloc))

将输出:

{'a': ['1', '3'], 'b': ['2']}
['x', 'y']
example.com
93.184.216.34