如何解析直到第一个单斜杠/
的链接并丢弃其余的?
链接列表:
https://stackoverflow.com/questions/tagged/
https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_processing_captcha.htm
https://codereview.stackexchange.com/questions/
https://docs.python.org/3/howto/regex.html
预期输出:
https://stackoverflow.com/
https://www.tutorialspoint.com/
https://codereview.stackexchange.com/
https://docs.python.org/
我尝试过:
linklist = [
"https://stackoverflow.com/questions/tagged/",
"https://codereview.stackexchange.com/questions/",
"https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_processing_captcha.htm",
"https://docs.python.org/3/howto/regex.html"
]
for link in linklist:
custom_link = link.split("/")[0]
print(custom_link)
这给了我
https:
https:
https:
https:
如何获得所需的链接部分?
答案 0 :(得分:1)
http://
之后有两个斜杠;因此,您需要加入split
的前三个元素:
linklist = [
"https://stackoverflow.com/questions/tagged/",
"https://codereview.stackexchange.com/questions/",
"https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_processing_captcha.htm",
"https://docs.python.org/3/howto/regex.html"
]
for link in linklist:
custom_link = '/'.join(link.split("/")[:3]) + '/'
print(custom_link)
https://stackoverflow.com/
https://codereview.stackexchange.com/
https://www.tutorialspoint.com/
https://docs.python.org/
对于更复杂的操作,您应该查看urllib.parse
。