我可以获取链接,但是Idk如何仅过滤https
答案 0 :(得分:-1)
要解析HTML,请使用html解析器,例如美丽的汤。要提取所需的<a>
元素,可以使用CSS选择器'a[href^="https"]'
(选择<a>
属性值以“ https”开头的每个href
元素):
import requests
from bs4 import BeautifulSoup
url = 'https://sayamkanwar.com/'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for a in soup.select('a[href^="https"]'):
print(a['href'])
打印:
https://sayamkanwar.com/work
https://sayamkanwar.com/about
https://www.facebook.com/sayamkanwar
https://github.com/sayamkanwar
https://codepen.io/sayamk/
https://medium.com/@sayamkanwar/
进一步阅读:
编辑:仅使用内置模块:
import urllib.request
from html.parser import HTMLParser
url = 'https://sayamkanwar.com/'
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag=='a':
attrs = dict(attrs)
if 'href' in attrs and attrs['href'].startswith('https'):
print(attrs['href'])
with urllib.request.urlopen(url) as response:
src = response.read().decode('utf-8')
parser = MyHTMLParser()
parser.feed(src)
打印:
https://sayamkanwar.com/work
https://sayamkanwar.com/about
https://www.facebook.com/sayamkanwar
https://github.com/sayamkanwar
https://codepen.io/sayamk/
https://medium.com/@sayamkanwar/
答案 1 :(得分:-1)
尝试一下,我只使用request
库。
import re
import requests
URL = 'https://sayamkanwar.com/'
response = requests.get(URL)
pattern = r'(a href=")((https):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)"'
all_url = re.findall(pattern, response.text)
for url in all_url:
print(url[1])
输出:
https://www.facebook.com/sayamkanwar
https://github.com/sayamkanwar
https://codepen.io/sayamk/
https://medium.com/@sayamkanwar/
正则表达式的视觉输出: