Question

我正在使用“ urllib.request.urlopen（URL）”查找不同服务器上不同文件的大小。问题是我需要对自己进行身份验证。我将通过以下操作来做到这一点。

url = "https://abc123-abca93.xxx.xxxx.se/other_parts_of_url/file.tar"
top_level_url = "https://abc123-abca93.xxx.xxxx.se/"
password_mgr.add_password(None, top_level_url, 'username',password.get())
handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
# create "opener" (OpenerDirector instance)
opener = urllib.request.build_opener(handler)

此方法可以使我现在在执行操作时访问文件

filesize = urllib.requests.urlopen(url).headers._headers[8][1]

但是问题是每个文件的URL都会改变，所以我想使用RegExp查找URL的第一部分，即

"https://"+more_characters+".se"+possibly_port_number+"/"

我当时以为我可以使用re.match，但是我不确定如何为这种情况编写正确的逻辑，例如可以做类似的事情

match = re.match("https://" + any amount of characters +"/", url)

Answer 1

您可以使用urllib的解析功能：

from urllib.parse import urlparse

url = "https://abc123-abca93.xxx.xxxx.se/other_parts_of_url/file.tar"

parse_result = urlparse(url)

top_level_url = parse_result.netloc

Answer 2

可能的正则表达式： https://regex101.com/r/GyEFx2/1

然后使用：

match = re.match(pattern, url)
if match:
    first_part = match.group(0)

Answer 3

这是一个常见问题，为此使用URLParse（python3 version）

from urllib.parse import urlparse
o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
toplevel = o.scheme + "://" + o.netloc

Answer 4

您也可以使用普通的旧str.split()：

Python 3.7.2 (default, Mar 21 2019, 10:05:02) 
[GCC 9.0.1 20190227 (Red Hat 9.0.1-0.8)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 'https://abc123-abca93.xxx.xxxx.se/other_parts_/file.tar'.split('/')
['https:', '', 'abc123-abca93.xxx.xxxx.se', 'other_parts_', 'file.tar']
>>>

如何使用re.match查找URL的第一部分？

4 个答案: