Question

给出一个网址，例如：

http://web.archive.org/web/20010312011552/www.feralhouse.com/cgi-bin/store/commerce.cgi?page=ac2.html

是否有一种方法（使用某些库，程序包或香草Python）来检索域“ www.feralhouse.com”？

我想到了简单地在“ www”处使用split，在“ com”处拆分第二索引项，然后将第一个索引项重新分组，例如：

url = "http://web.archive.org/web/20010312011552/www.feralhouse.com/cgi-bin/store/commerce.cgi?page=ac2.html"
url1=url.split("www")
url2=url1[1].split("com")
desired_output = "www"+url2[0]+"com"
print(desired_output)
#www.feralhouse.com

但是这种方法有一些例外（没有www的网站，我认为它们依靠浏览器自动更改）。如果可能的话，我宁愿使用一种不太“ hacky”的方法。预先感谢！

注意：我不想要仅用于此特定URL的解决方案，我想要所有可能的存档URL的解决方案。

编辑：另一个示例网址

http://web.archive.org/web/20000614170338/http://www.clonejesus.com/

Answer 1

两种方法，一种具有split方法，一种具有re模块：

s = 'http://web.archive.org/web/20010312011552/www.feralhouse.com/cgi-bin/store/commerce.cgi?page=ac2.html'

print(s.split('/', 5)[-1])

import re

print(re.findall(r'\d{14}/(.*)', s)[0])

打印：

www.feralhouse.com/cgi-bin/store/commerce.cgi?page=ac2.html
www.feralhouse.com/cgi-bin/store/commerce.cgi?page=ac2.html

如何使用Python中的存档URL检索Web存档网站的域？

1 个答案: