如何从网址获取特定路径部分?例如,我想要一个对此进行操作的函数:
http://www.mydomain.com/hithere?image=2934
并返回“hithere”
或对此进行操作:
http://www.mydomain.com/hithere/something/else
并返回相同的内容(“hithere”)
我知道这可能会使用urllib或urllib2,但我无法从文档中找出如何仅获取路径的一部分。
答案 0 :(得分:36)
使用urlparse提取网址的路径组件:
>>> import urlparse
>>> path = urlparse.urlparse('http://www.example.com/hithere/something/else').path
>>> path
'/hithere/something/else'
使用os.path将路径拆分为组件.split:
>>> import os.path
>>> os.path.split(path)
('/hithere/something', 'else')
dirname和basename函数为你提供了两个分割;也许在while循环中使用dirname:
>>> while os.path.dirname(path) != '/':
... path = os.path.dirname(path)
...
>>> path
'/hithere'
答案 1 :(得分:17)
最佳选择是在使用URL的路径组件时使用posixpath
模块。此模块与os.path
具有相同的接口,并且在基于POSIX和Windows NT的平台上使用时,始终在POSIX路径上运行。
示例代码:
#!/usr/bin/env python3
import urllib.parse
import sys
import posixpath
import ntpath
import json
def path_parse( path_string, *, normalize = True, module = posixpath ):
result = []
if normalize:
tmp = module.normpath( path_string )
else:
tmp = path_string
while tmp != "/":
( tmp, item ) = module.split( tmp )
result.insert( 0, item )
return result
def dump_array( array ):
string = "[ "
for index, item in enumerate( array ):
if index > 0:
string += ", "
string += "\"{}\"".format( item )
string += " ]"
return string
def test_url( url, *, normalize = True, module = posixpath ):
url_parsed = urllib.parse.urlparse( url )
path_parsed = path_parse( urllib.parse.unquote( url_parsed.path ),
normalize=normalize, module=module )
sys.stdout.write( "{}\n --[n={},m={}]-->\n {}\n".format(
url, normalize, module.__name__, dump_array( path_parsed ) ) )
test_url( "http://eg.com/hithere/something/else" )
test_url( "http://eg.com/hithere/something/else/" )
test_url( "http://eg.com/hithere/something/else/", normalize = False )
test_url( "http://eg.com/hithere/../else" )
test_url( "http://eg.com/hithere/../else", normalize = False )
test_url( "http://eg.com/hithere/../../else" )
test_url( "http://eg.com/hithere/../../else", normalize = False )
test_url( "http://eg.com/hithere/something/./else" )
test_url( "http://eg.com/hithere/something/./else", normalize = False )
test_url( "http://eg.com/hithere/something/./else/./" )
test_url( "http://eg.com/hithere/something/./else/./", normalize = False )
test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False )
test_url( "http://eg.com/see%5C/if%5C/this%5C/works", normalize = False,
module = ntpath )
代码输出:
http://eg.com/hithere/something/else
--[n=True,m=posixpath]-->
[ "hithere", "something", "else" ]
http://eg.com/hithere/something/else/
--[n=True,m=posixpath]-->
[ "hithere", "something", "else" ]
http://eg.com/hithere/something/else/
--[n=False,m=posixpath]-->
[ "hithere", "something", "else", "" ]
http://eg.com/hithere/../else
--[n=True,m=posixpath]-->
[ "else" ]
http://eg.com/hithere/../else
--[n=False,m=posixpath]-->
[ "hithere", "..", "else" ]
http://eg.com/hithere/../../else
--[n=True,m=posixpath]-->
[ "else" ]
http://eg.com/hithere/../../else
--[n=False,m=posixpath]-->
[ "hithere", "..", "..", "else" ]
http://eg.com/hithere/something/./else
--[n=True,m=posixpath]-->
[ "hithere", "something", "else" ]
http://eg.com/hithere/something/./else
--[n=False,m=posixpath]-->
[ "hithere", "something", ".", "else" ]
http://eg.com/hithere/something/./else/./
--[n=True,m=posixpath]-->
[ "hithere", "something", "else" ]
http://eg.com/hithere/something/./else/./
--[n=False,m=posixpath]-->
[ "hithere", "something", ".", "else", ".", "" ]
http://eg.com/see%5C/if%5C/this%5C/works
--[n=False,m=posixpath]-->
[ "see\", "if\", "this\", "works" ]
http://eg.com/see%5C/if%5C/this%5C/works
--[n=False,m=ntpath]-->
[ "see", "if", "this", "works" ]
注意:
os.path
上是ntpath
os.path
上是posixpath
ntpath
无法正确处理反斜杠(\
)(请参阅代码/输出中的最后两种情况) - 这就是推荐使用posixpath
的原因。urllib.parse.unquote
posixpath.normpath
/
)的语义。但是,posixpath
会折叠多个相邻的路径分隔符(即它会同时处理///
,//
和/
)规范性参考文献:
答案 2 :(得分:8)
Python 3.4+解决方案:
from urllib.parse import unquote, urlparse
from pathlib import PurePosixPath
url = 'http://www.example.com/hithere/something/else'
PurePosixPath(
unquote(
urlparse(
url
).path
)
).parts[1]
# returns 'hithere' (the same for the URL with parameters)
# parts holds ('/', 'hithere', 'something', 'else')
# 0 1 2 3
答案 3 :(得分:3)
Python3导入中的注释已更改为from urllib.parse import urlparse
,请参见documentation。这是一个示例:
>>> from urllib.parse import urlparse
>>> url = 's3://bucket.test/my/file/directory'
>>> p = urlparse(url)
>>> p
ParseResult(scheme='s3', netloc='bucket.test', path='/my/file/directory', params='', query='', fragment='')
>>> p.scheme
's3'
>>> p.netloc
'bucket.test'
>>> p.path
'/my/file/directory'
答案 4 :(得分:0)
urlparse和os.path.split的组合将起到作用。以下脚本将列表的所有部分存储在列表中,向后。
import os.path, urlparse
def generate_sections_of_url(url):
path = urlparse.urlparse(url).path
sections = []; temp = "";
while path != '/':
temp = os.path.split(path)
path = temp[0]
sections.append(temp[1])
return sections
这将返回:[" else","某事"," hithere"]
答案 5 :(得分:0)
以下是使用urlparse和rpartition的示例。
# Python 2x:
from urlparse import urlparse
# Python 3x:
from urllib.parse import urlparse
def printPathTokens(full_url):
print('printPathTokens() called: %s' % full_url)
p_full = urlparse(full_url).path
print(' . p_full url: %s' % p_full)
# Split the path using rpartition method of string
# rpartition "returns a tuple containing the part the before separator,
# argument string and the part after the separator"
(rp_left, rp_match, rp_right) = p_full.rpartition('/')
if rp_match == '': # returns the rpartition separator if found
print(' . No slashes found in path')
else:
print(' . path to last resource: %s' % rp_left)
if rp_right == '': # Ended with a slash
print(' . last resource: (none)')
else:
print(' . last resource: %s' % (rp_right))
printPathTokens('http://www.example.com/temp/something/happen/index.html')
# Output:
# printPathTokens() called: http://www.example.com/temp/something/happen/index.html
# . p_full url: /temp/something/happen/index.html
# . path to last resource: /temp/something/happen
# . last resource: index.html
printPathTokens('http://www.example.com/temp/something/happen/')
# Output:
# printPathTokens() called: http://www.example.com/temp/something/happen/
# . p_full url: /temp/something/happen/
# . path to last resource: /temp/something/happen
# . last resource: (none)
printPathTokens('http://www.example.com/temp/something/happen')
# Output:
# printPathTokens() called: http://www.example.com/temp/something/happen
# . p_full url: /temp/something/happen
# . path to last resource: /temp/something
# . last resource: happen