无法处理高达特定部分的链接

时间:2019-07-31 18:05:47

标签: python python-3.x

如何解析直到第一个单斜杠/的链接并丢弃其余的?

链接列表:

https://stackoverflow.com/questions/tagged/
https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_processing_captcha.htm
https://codereview.stackexchange.com/questions/
https://docs.python.org/3/howto/regex.html

预期输出:

https://stackoverflow.com/
https://www.tutorialspoint.com/
https://codereview.stackexchange.com/
https://docs.python.org/

我尝试过:

linklist = [
    "https://stackoverflow.com/questions/tagged/",
    "https://codereview.stackexchange.com/questions/",
    "https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_processing_captcha.htm",
    "https://docs.python.org/3/howto/regex.html"
]

for link in linklist:
    custom_link = link.split("/")[0]
    print(custom_link)

这给了我

https:
https:
https:
https:
  

如何获得所需的链接部分?

1 个答案:

答案 0 :(得分:1)

http://之后有两个斜杠;因此,您需要加入split的前三个元素:

linklist = [
    "https://stackoverflow.com/questions/tagged/",
    "https://codereview.stackexchange.com/questions/",
    "https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_processing_captcha.htm",
    "https://docs.python.org/3/howto/regex.html"
]

for link in linklist:
    custom_link = '/'.join(link.split("/")[:3]) + '/'
    print(custom_link)
https://stackoverflow.com/
https://codereview.stackexchange.com/
https://www.tutorialspoint.com/
https://docs.python.org/

对于更复杂的操作,您应该查看urllib.parse