Python中的URL解析 - 规范化路径中的双斜杠

时间:2012-01-19 12:21:30

标签: python urlparse

我正在开发一个需要在HTML页面中解析URL(主要是HTTP URL)的应用程序 - 我无法控制输入,其中一些是正如预期的那样有点混乱。

我经常遇到的一个问题是,在解析和加入路径部分中包含双斜线的URL时,urlparse是非常严格的(甚至可能是错误的?),例如:

testUrl = 'http://www.example.com//path?foo=bar'
urlparse.urljoin(testUrl, 
                 urlparse.urlparse(testUrl).path)

而不是预期的结果http://www.example.com//path(甚至更好,使用标准化的单斜杠),我最终得到http://path

顺便说一下,我运行此类代码的原因是因为这是我到目前为止发现的唯一一种从URL中删除查询/片段部分的方法。也许有更好的方法,但我找不到一个。

任何人都可以推荐一种方法来避免这种情况,或者我应该使用(相对简单,我知道)正则表达式来自我规范化路径?

8 个答案:

答案 0 :(得分:6)

单独的路径(//path)无效,这会混淆函数并被解释为主机名

http://tools.ietf.org/html/rfc3986.html#section-3.3

  

如果URI不包含权限组件,则该路径不能以两个斜杠字符开头(“//”)。

我并不特别喜欢这两种解决方案,但它们有效:

import re
import urlparse

testurl = 'http://www.example.com//path?foo=bar'

parsed = list(urlparse.urlparse(testurl))
parsed[2] = re.sub("/{2,}", "/", parsed[2]) # replace two or more / with one
cleaned = urlparse.urlunparse(parsed)

print cleaned
# http://www.example.com/path?foo=bar

print urlparse.urljoin(
    testurl, 
    urlparse.urlparse(cleaned).path)

# http://www.example.com//path

根据您的操作,您可以手动加入:

import re
import urlparse

testurl = 'http://www.example.com//path?foo=bar'
parsed = list(urlparse.urlparse(testurl))

newurl = ["" for i in range(6)] # could urlparse another address instead

# Copy first 3 values from
# ['http', 'www.example.com', '//path', '', 'foo=bar', '']
for i in range(3):
    newurl[i] = parsed[i]

# Rest are blank
for i in range(4, 6):
    newurl[i] = ''

print urlparse.urlunparse(newurl)
# http://www.example.com//path

答案 1 :(得分:4)

如果您只想获取没有查询部分的网址,我会跳过urlparse模块,然后执行:

testUrl.rsplit('?')

url将位于返回列表的索引0和索引1处的查询。

不可能有两个'?'在一个网址,所以它应该适用于所有网址。

答案 2 :(得分:2)

official urlparse docs中提到:

  

如果url是绝对URL(即以//或scheme://开头),则url的主机名和/或方案将出现在结果中。例如

urljoin('http://www.cwi.nl/%7Eguido/Python.html',
...         '//www.python.org/%7Eguido')
'http://www.python.org/%7Eguido'
  

如果您不想要这种行为,请使用urlsplit()和urlunsplit()预处理网址,删除可能的方案和netloc部分。

所以你可以这样做:

urlparse.urljoin(testUrl,
             urlparse.urlparse(testUrl).path.replace('//','/'))

输出= 'http://www.example.com/path'

答案 3 :(得分:0)

这不是一个解决方案吗?

urlparse.urlparse(testUrl).path.replace('//', '/')

答案 4 :(得分:0)

试试这个:

def http_normalize_slashes(url):
    url = str(url)
    segments = url.split('/')
    correct_segments = []
    for segment in segments:
        if segment != '':
            correct_segments.append(segment)
    first_segment = str(correct_segments[0])
    if first_segment.find('http') == -1:
        correct_segments = ['http:'] + correct_segments
    correct_segments[0] = correct_segments[0] + '/'
    normalized_url = '/'.join(correct_segments)
    return normalized_url

示例网址:

print(http_normalize_slashes('http://www.example.com//path?foo=bar'))
print(http_normalize_slashes('http:/www.example.com//path?foo=bar'))
print(http_normalize_slashes('www.example.com//x///c//v///path?foo=bar'))
print(http_normalize_slashes('http://////www.example.com//x///c//v///path?foo=bar'))

将返回:

http://www.example.com/path?foo=bar
http://www.example.com/path?foo=bar
http://www.example.com/x/c/v/path?foo=bar
http://www.example.com/x/c/v/path?foo=bar

希望它有所帮助.. :)

答案 5 :(得分:0)

在我尝试纠正路径中的双斜杠而未触及http://位的初始双斜杠的情况下,此answer似乎提供了最佳结果。

代码如下:

from urlparse import urljoin
from functools import reduce


def slash_join(*args):
    return reduce(urljoin, args).rstrip("/")

答案 6 :(得分:0)

我已经接受了我的需要@yunhasnawa的回答。这是一部分:

import urllib2
from urlparse import urlparse, urlunparse

def sanitize_url(url):
    url_parsed = urlparse(url)  
    return urlunparse((url_parsed.scheme, url_parsed.netloc, avoid_double_slash(url_parsed.path), '', '', ''))

def avoid_double_slash(path):
  parts = path.split('/')
  not_empties = [part for part in parts if part]
  return '/'.join(not_empties)


>>> sanitize_url('https://hostname.doma.in:8443/complex-path////next//')
'https://hostname.doma.in:8443/complex-path/next'

答案 7 :(得分:0)

这可能并不完全安全,但是您可以使用此正则表达式:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>
#include <stdbool.h>

#define SIZE 30  /*The size of the array of names*/
#define MAX_L 21    /*The length of a name included  the ending 0*/
#define REPEAT 10 /*The amount of getting a random name in the names list */

char names[SIZE][MAX_L];/* = { 0 } ; /*Global array for the names we get from the users*/

/*Gives back a random name in an array nameslist*/
char * get_name(){
    int random; /*the random index we get of the names list*/
    char *r; /*the string name to return*/

    random= rand()%SIZE; /*picks a random number from 0-29 */
    r= names[random]; /*r points to the random name in the list*/
    return r;
}


/*Gets from user 30 names without repeat and calls after that function get_names 10 times*/
int  main(){
    int i;  /*counter for array names list */
    int j;  /*counter for array in the first inner loop to check if there are repeated names*/
    int k;  /*counter for the repeat loop for random names*/
    int w;  /*counter for the index of the character in the string */
    bool same = true; /*says if two strings are the same or not*/

    for (i=0; i< SIZE; i++){  /*Gets from the user 30 names and initialize them in the array*/
        printf("\nPlease enter a name (repeated names forbidden until we'll get to 30 names)\n");
        scanf("%s", names[i]);

       if (i>0){
       for (j=0; j<i; j++){  /*checks if is a repeated name or not*/
        for (w=0; w< MAX_L || same ==false; w++){
            if (names[i][w] != names[j][w]){
                    if (names[i][w] >= 'a' && names[i][w] <= 'z'){ 
                            if (names[i][w] - 32 != names[j][w]) 
                                same=false;}

                else if (names[i][w] >= 'A' && names[i][w] <= 'Z'){
                            if (names[i][w] + 32 != names[j][w]) 
                                same=false;}
            }
        }   
        if (same ==true){ /*repeated name*/
             printf("\nERROR! You already entered this name!");
             return 0;}     
       }}
       printf("\nThe name you entered is: %s\n", names[i]);
    }

    for (k=0; k<REPEAT; k++){ /*Calls the function get_name 10 times to get 10 random names from the array*/
        printf("\nThe random name you got from the list is: %s", get_name());
    }
    return 0;
}

它将用“ [非冒号]后跟斜杠”代替“ [非冒号]后跟 2 斜杠”。 [非冒号]用于保留http://或https://。