Question

我最近在使用python编写网络抓取工具时观看了 thenewboston 视频。出于某种原因，我得到了SSLError。我尝试使用第6行代码修复它，但没有运气。知道它为什么会抛出错误吗？代码是 thenewboston 的逐字记录。

import requests
from bs4 import BeautifulSoup

def creepy_crawly(max_pages):
    page = 1
    #requests.get('https://www.thenewboston.com/', verify = True)
    while page <= max_pages:

        url = "https://www.thenewboston.com/trade/search.php?pages=" + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text 
        soup = BeautifulSoup(plain_text)

        for link in soup.findAll('a', {'class' : 'item-name'}):
            href = "https://www.thenewboston.com" + link.get('href')
            print(href)

        page += 1

creepy_crawly(1)

Answer 1

我已经使用urllib完成了一个网络爬虫，它可以更快，访问https页面没有问题，但有一点是它不验证服务器证书，这使它更快但更危险（易受mitm攻击）攻击）。 Bellow有一个lib的用法示例：

link = 'https://www.stackoverflow.com'    
html = urllib.urlopen(link).read()
print(html)

从页面中获取HTML需要3行，简单不是吗？

有关urllib的更多信息：https://docs.python.org/2/library/urllib.html

我还建议你在HTML上使用正则表达式来获取其他链接，一个例子（使用re库）将是：

    for url in re.findall(r'<a[^>]+href=["\'](.[^"\']+)["\']', html, re.I):  # Searches the HTML for other URLs
        link = url.split("#", 1)[0] \
        if url.startswith("http") \
        else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0] # Checks if the HTML is valid and format it

来自thenewboston的Python Web Crawler

1 个答案: