我正在开发一个需要在HTML页面中解析URL(主要是HTTP URL)的应用程序 - 我无法控制输入,其中一些是正如预期的那样有点混乱。
我经常遇到的一个问题是,在解析和加入路径部分中包含双斜线的URL时,urlparse是非常严格的(甚至可能是错误的?),例如:
testUrl = 'http://www.example.com//path?foo=bar'
urlparse.urljoin(testUrl,
urlparse.urlparse(testUrl).path)
而不是预期的结果http://www.example.com//path
(甚至更好,使用标准化的单斜杠),我最终得到http://path
。
顺便说一下,我运行此类代码的原因是因为这是我到目前为止发现的唯一一种从URL中删除查询/片段部分的方法。也许有更好的方法,但我找不到一个。
任何人都可以推荐一种方法来避免这种情况,或者我应该使用(相对简单,我知道)正则表达式来自我规范化路径?
答案 0 :(得分:6)
单独的路径(//path
)无效,这会混淆函数并被解释为主机名
http://tools.ietf.org/html/rfc3986.html#section-3.3
如果URI不包含权限组件,则该路径不能以两个斜杠字符开头(“//”)。
我并不特别喜欢这两种解决方案,但它们有效:
import re
import urlparse
testurl = 'http://www.example.com//path?foo=bar'
parsed = list(urlparse.urlparse(testurl))
parsed[2] = re.sub("/{2,}", "/", parsed[2]) # replace two or more / with one
cleaned = urlparse.urlunparse(parsed)
print cleaned
# http://www.example.com/path?foo=bar
print urlparse.urljoin(
testurl,
urlparse.urlparse(cleaned).path)
# http://www.example.com//path
根据您的操作,您可以手动加入:
import re
import urlparse
testurl = 'http://www.example.com//path?foo=bar'
parsed = list(urlparse.urlparse(testurl))
newurl = ["" for i in range(6)] # could urlparse another address instead
# Copy first 3 values from
# ['http', 'www.example.com', '//path', '', 'foo=bar', '']
for i in range(3):
newurl[i] = parsed[i]
# Rest are blank
for i in range(4, 6):
newurl[i] = ''
print urlparse.urlunparse(newurl)
# http://www.example.com//path
答案 1 :(得分:4)
如果您只想获取没有查询部分的网址,我会跳过urlparse模块,然后执行:
testUrl.rsplit('?')
url将位于返回列表的索引0和索引1处的查询。
不可能有两个'?'在一个网址,所以它应该适用于所有网址。
答案 2 :(得分:2)
如果url是绝对URL(即以//或scheme://开头),则url的主机名和/或方案将出现在结果中。例如
urljoin('http://www.cwi.nl/%7Eguido/Python.html',
... '//www.python.org/%7Eguido')
'http://www.python.org/%7Eguido'
如果您不想要这种行为,请使用urlsplit()和urlunsplit()预处理网址,删除可能的方案和netloc部分。
所以你可以这样做:
urlparse.urljoin(testUrl,
urlparse.urlparse(testUrl).path.replace('//','/'))
输出= 'http://www.example.com/path'
答案 3 :(得分:0)
这不是一个解决方案吗?
urlparse.urlparse(testUrl).path.replace('//', '/')
答案 4 :(得分:0)
试试这个:
def http_normalize_slashes(url):
url = str(url)
segments = url.split('/')
correct_segments = []
for segment in segments:
if segment != '':
correct_segments.append(segment)
first_segment = str(correct_segments[0])
if first_segment.find('http') == -1:
correct_segments = ['http:'] + correct_segments
correct_segments[0] = correct_segments[0] + '/'
normalized_url = '/'.join(correct_segments)
return normalized_url
示例网址:
print(http_normalize_slashes('http://www.example.com//path?foo=bar'))
print(http_normalize_slashes('http:/www.example.com//path?foo=bar'))
print(http_normalize_slashes('www.example.com//x///c//v///path?foo=bar'))
print(http_normalize_slashes('http://////www.example.com//x///c//v///path?foo=bar'))
将返回:
http://www.example.com/path?foo=bar
http://www.example.com/path?foo=bar
http://www.example.com/x/c/v/path?foo=bar
http://www.example.com/x/c/v/path?foo=bar
希望它有所帮助.. :)
答案 5 :(得分:0)
在我尝试纠正路径中的双斜杠而未触及http://位的初始双斜杠的情况下,此answer似乎提供了最佳结果。
代码如下:
from urlparse import urljoin
from functools import reduce
def slash_join(*args):
return reduce(urljoin, args).rstrip("/")
答案 6 :(得分:0)
我已经接受了我的需要@yunhasnawa的回答。这是一部分:
import urllib2
from urlparse import urlparse, urlunparse
def sanitize_url(url):
url_parsed = urlparse(url)
return urlunparse((url_parsed.scheme, url_parsed.netloc, avoid_double_slash(url_parsed.path), '', '', ''))
def avoid_double_slash(path):
parts = path.split('/')
not_empties = [part for part in parts if part]
return '/'.join(not_empties)
>>> sanitize_url('https://hostname.doma.in:8443/complex-path////next//')
'https://hostname.doma.in:8443/complex-path/next'
答案 7 :(得分:0)
这可能并不完全安全,但是您可以使用此正则表达式:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>
#include <stdbool.h>
#define SIZE 30 /*The size of the array of names*/
#define MAX_L 21 /*The length of a name included the ending 0*/
#define REPEAT 10 /*The amount of getting a random name in the names list */
char names[SIZE][MAX_L];/* = { 0 } ; /*Global array for the names we get from the users*/
/*Gives back a random name in an array nameslist*/
char * get_name(){
int random; /*the random index we get of the names list*/
char *r; /*the string name to return*/
random= rand()%SIZE; /*picks a random number from 0-29 */
r= names[random]; /*r points to the random name in the list*/
return r;
}
/*Gets from user 30 names without repeat and calls after that function get_names 10 times*/
int main(){
int i; /*counter for array names list */
int j; /*counter for array in the first inner loop to check if there are repeated names*/
int k; /*counter for the repeat loop for random names*/
int w; /*counter for the index of the character in the string */
bool same = true; /*says if two strings are the same or not*/
for (i=0; i< SIZE; i++){ /*Gets from the user 30 names and initialize them in the array*/
printf("\nPlease enter a name (repeated names forbidden until we'll get to 30 names)\n");
scanf("%s", names[i]);
if (i>0){
for (j=0; j<i; j++){ /*checks if is a repeated name or not*/
for (w=0; w< MAX_L || same ==false; w++){
if (names[i][w] != names[j][w]){
if (names[i][w] >= 'a' && names[i][w] <= 'z'){
if (names[i][w] - 32 != names[j][w])
same=false;}
else if (names[i][w] >= 'A' && names[i][w] <= 'Z'){
if (names[i][w] + 32 != names[j][w])
same=false;}
}
}
if (same ==true){ /*repeated name*/
printf("\nERROR! You already entered this name!");
return 0;}
}}
printf("\nThe name you entered is: %s\n", names[i]);
}
for (k=0; k<REPEAT; k++){ /*Calls the function get_name 10 times to get 10 random names from the array*/
printf("\nThe random name you got from the list is: %s", get_name());
}
return 0;
}
它将用“ [非冒号]后跟单斜杠”代替“ [非冒号]后跟 2 斜杠”。 [非冒号]用于保留http://或https://。