Question

最终，我正在尝试创建一个程序，该程序将抓取网站的所有内部链接并抓取该网站上的所有联系信息。但是，在我到达那里之前，我需要弄清楚如何通过严格的内部链接进行爬网。我下面粘贴的代码似乎根本不起作用，我也不知道为什么。

TL; DR想要制作一个仅对内部链接进行爬网和打印的程序

import re, request
from bs4 import BeautifulSoup

def processUrl(url, domain, checkedUrls=[]):
    if domain not in url:
        return checkedUrls

    if not url in checkedUrls:
        try:
            if 'text/html' in requests.head(url).headers['Content-Type']:
                req=requests.get(url)
                if req.status_code==200:
                    print(url)
                    checkedUrls.append(url)
                    html=BeautifulSoup(req.text,'html.parser')
                    pages=html.find_all('a')
                    for page in pages:
                        url=page.get('href')
                        processUrl(url)
        except:
            pass

    return checkedUrls


checkedUrls=[]
domain = 'sentdex.com'
url='http://sentdex.com'
checkedUrls = processUrl(url, domain, checkedUrls)

使用python从网站抓取并仅打印内部链接

0 个答案: