Question

我正在开发一个需要在HTML页面中解析URL（主要是HTTP URL）的应用程序 - 我无法控制输入，其中一些是正如预期的那样有点混乱。

我经常遇到的一个问题是，在解析和加入路径部分中包含双斜线的URL时，urlparse是非常严格的（甚至可能是错误的？），例如：

testUrl = 'http://www.example.com//path?foo=bar'
urlparse.urljoin(testUrl, 
                 urlparse.urlparse(testUrl).path)

而不是预期的结果http://www.example.com//path（甚至更好，使用标准化的单斜杠），我最终得到http://path。

顺便说一下，我运行此类代码的原因是因为这是我到目前为止发现的唯一一种从URL中删除查询/片段部分的方法。也许有更好的方法，但我找不到一个。

任何人都可以推荐一种方法来避免这种情况，或者我应该使用（相对简单，我知道）正则表达式来自我规范化路径？

Answer 1

单独的路径（//path）无效，这会混淆函数并被解释为主机名

http://tools.ietf.org/html/rfc3986.html#section-3.3

如果URI不包含权限组件，则该路径不能以两个斜杠字符开头（“//”）。

我并不特别喜欢这两种解决方案，但它们有效：

import re
import urlparse

testurl = 'http://www.example.com//path?foo=bar'

parsed = list(urlparse.urlparse(testurl))
parsed[2] = re.sub("/{2,}", "/", parsed[2]) # replace two or more / with one
cleaned = urlparse.urlunparse(parsed)

print cleaned
# http://www.example.com/path?foo=bar

print urlparse.urljoin(
    testurl, 
    urlparse.urlparse(cleaned).path)

# http://www.example.com//path

根据您的操作，您可以手动加入：

import re
import urlparse

testurl = 'http://www.example.com//path?foo=bar'
parsed = list(urlparse.urlparse(testurl))

newurl = ["" for i in range(6)] # could urlparse another address instead

# Copy first 3 values from
# ['http', 'www.example.com', '//path', '', 'foo=bar', '']
for i in range(3):
    newurl[i] = parsed[i]

# Rest are blank
for i in range(4, 6):
    newurl[i] = ''

print urlparse.urlunparse(newurl)
# http://www.example.com//path

Answer 2

如果您只想获取没有查询部分的网址，我会跳过urlparse模块，然后执行：

testUrl.rsplit('?')

url将位于返回列表的索引0和索引1处的查询。

不可能有两个'？'在一个网址，所以它应该适用于所有网址。

Answer 3

在official urlparse docs中提到：

如果url是绝对URL（即以//或scheme：//开头），则url的主机名和/或方案将出现在结果中。例如

urljoin('http://www.cwi.nl/%7Eguido/Python.html',
...         '//www.python.org/%7Eguido')
'http://www.python.org/%7Eguido'

如果您不想要这种行为，请使用urlsplit（）和urlunsplit（）预处理网址，删除可能的方案和netloc部分。

所以你可以这样做：

urlparse.urljoin(testUrl,
             urlparse.urlparse(testUrl).path.replace('//','/'))

输出= 'http://www.example.com/path'

Answer 4

这不是一个解决方案吗？

urlparse.urlparse(testUrl).path.replace('//', '/')

Answer 5

试试这个：

def http_normalize_slashes(url):
    url = str(url)
    segments = url.split('/')
    correct_segments = []
    for segment in segments:
        if segment != '':
            correct_segments.append(segment)
    first_segment = str(correct_segments[0])
    if first_segment.find('http') == -1:
        correct_segments = ['http:'] + correct_segments
    correct_segments[0] = correct_segments[0] + '/'
    normalized_url = '/'.join(correct_segments)
    return normalized_url

示例网址：

print(http_normalize_slashes('http://www.example.com//path?foo=bar'))
print(http_normalize_slashes('http:/www.example.com//path?foo=bar'))
print(http_normalize_slashes('www.example.com//x///c//v///path?foo=bar'))
print(http_normalize_slashes('http://////www.example.com//x///c//v///path?foo=bar'))

将返回：

http://www.example.com/path?foo=bar
http://www.example.com/path?foo=bar
http://www.example.com/x/c/v/path?foo=bar
http://www.example.com/x/c/v/path?foo=bar

希望它有所帮助.. :)

Answer 6

在我尝试纠正路径中的双斜杠而未触及http：//位的初始双斜杠的情况下，此answer似乎提供了最佳结果。

代码如下：

from urlparse import urljoin
from functools import reduce


def slash_join(*args):
    return reduce(urljoin, args).rstrip("/")

Answer 7

我已经接受了我的需要@yunhasnawa的回答。这是一部分：

import urllib2
from urlparse import urlparse, urlunparse

def sanitize_url(url):
    url_parsed = urlparse(url)  
    return urlunparse((url_parsed.scheme, url_parsed.netloc, avoid_double_slash(url_parsed.path), '', '', ''))

def avoid_double_slash(path):
  parts = path.split('/')
  not_empties = [part for part in parts if part]
  return '/'.join(not_empties)


>>> sanitize_url('https://hostname.doma.in:8443/complex-path////next//')
'https://hostname.doma.in:8443/complex-path/next'

Answer 8

这可能并不完全安全，但是您可以使用此正则表达式：

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>
#include <stdbool.h>

#define SIZE 30  /*The size of the array of names*/
#define MAX_L 21    /*The length of a name included  the ending 0*/
#define REPEAT 10 /*The amount of getting a random name in the names list */

char names[SIZE][MAX_L];/* = { 0 } ; /*Global array for the names we get from the users*/

/*Gives back a random name in an array nameslist*/
char * get_name(){
    int random; /*the random index we get of the names list*/
    char *r; /*the string name to return*/

    random= rand()%SIZE; /*picks a random number from 0-29 */
    r= names[random]; /*r points to the random name in the list*/
    return r;
}


/*Gets from user 30 names without repeat and calls after that function get_names 10 times*/
int  main(){
    int i;  /*counter for array names list */
    int j;  /*counter for array in the first inner loop to check if there are repeated names*/
    int k;  /*counter for the repeat loop for random names*/
    int w;  /*counter for the index of the character in the string */
    bool same = true; /*says if two strings are the same or not*/

    for (i=0; i< SIZE; i++){  /*Gets from the user 30 names and initialize them in the array*/
        printf("\nPlease enter a name (repeated names forbidden until we'll get to 30 names)\n");
        scanf("%s", names[i]);

       if (i>0){
       for (j=0; j<i; j++){  /*checks if is a repeated name or not*/
        for (w=0; w< MAX_L || same ==false; w++){
            if (names[i][w] != names[j][w]){
                    if (names[i][w] >= 'a' && names[i][w] <= 'z'){ 
                            if (names[i][w] - 32 != names[j][w]) 
                                same=false;}

                else if (names[i][w] >= 'A' && names[i][w] <= 'Z'){
                            if (names[i][w] + 32 != names[j][w]) 
                                same=false;}
            }
        }   
        if (same ==true){ /*repeated name*/
             printf("\nERROR! You already entered this name!");
             return 0;}     
       }}
       printf("\nThe name you entered is: %s\n", names[i]);
    }

    for (k=0; k<REPEAT; k++){ /*Calls the function get_name 10 times to get 10 random names from the array*/
        printf("\nThe random name you got from the list is: %s", get_name());
    }
    return 0;
}

它将用“ [非冒号]后跟单斜杠”代替“ [非冒号]后跟 2 斜杠”。 [非冒号]用于保留http：//或https：//。

Python中的URL解析 - 规范化路径中的双斜杠

8 个答案: