仅获取https链接

时间:2019-07-16 06:07:06

标签: python regex python-3.x web-crawler

我可以获取链接,但是Idk如何仅过滤https

2 个答案:

答案 0 :(得分:-1)

要解析HTML,请使用html解析器,例如美丽的汤。要提取所需的<a>元素,可以使用CSS选择器'a[href^="https"]'(选择<a>属性值以“ https”开头的每个href元素):

import requests
from bs4 import BeautifulSoup

url = 'https://sayamkanwar.com/'
soup = BeautifulSoup(requests.get(url).text, 'lxml')

for a in soup.select('a[href^="https"]'):
    print(a['href'])

打印:

https://sayamkanwar.com/work
https://sayamkanwar.com/about
https://www.facebook.com/sayamkanwar
https://github.com/sayamkanwar
https://codepen.io/sayamk/
https://medium.com/@sayamkanwar/

进一步阅读:

CSS Selectors Reference

编辑:仅使用内置模块:

import urllib.request
from html.parser import HTMLParser

url = 'https://sayamkanwar.com/'

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag=='a':
            attrs = dict(attrs)
            if 'href' in attrs and attrs['href'].startswith('https'):
                print(attrs['href'])

with urllib.request.urlopen(url) as response:
   src = response.read().decode('utf-8')

parser = MyHTMLParser()
parser.feed(src)

打印:

https://sayamkanwar.com/work
https://sayamkanwar.com/about
https://www.facebook.com/sayamkanwar
https://github.com/sayamkanwar
https://codepen.io/sayamk/
https://medium.com/@sayamkanwar/

答案 1 :(得分:-1)

尝试一下,我只使用request库。

import re
import requests

URL = 'https://sayamkanwar.com/'
response = requests.get(URL)
pattern = r'(a href=")((https):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)"'
all_url = re.findall(pattern, response.text)
for url in all_url:
    print(url[1])

输出:

https://www.facebook.com/sayamkanwar
https://github.com/sayamkanwar
https://codepen.io/sayamk/
https://medium.com/@sayamkanwar/

正则表达式的视觉输出:

enter image description here