Python爬虫(bs4,urlopen)出现故障

时间:2018-05-20 17:30:27

标签: python-3.x web-crawler

我正在玩一个包含mtg卡的网页,我正在尝试提取有关它们的一些信息。以下程序工作正常,我可以抓取抛出页面并检索所有理想的信息:

import re
from math import ceil
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

def NumOfNextPages(TotalCardNum, CardsPerPage):
    pages = ceil(TotalCardNum / CardsPerPage)
    return pages

URL = "xyz.com"
NumOfCrawledPages = 0

UClient = uReq(URL)  # downloading the url
page_html = UClient.read()
UClient.close()

# html parsing
page_soup = soup(page_html, "html.parser")


# Finds all the cards that exist in the webpage and stores them as a bs4 object
cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})
CardsPerPage = len(cards)


# Selects the card names, Power and Toughness, Set that they belong
for card in cards:

    card_name = card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")

    if len(card.div.contents) > 3:
        cardP_T = card.div.contents[3].contents[1].text.replace("\n", "").strip()
    else:
        cardP_T = "Does not exist"

    cardType = card.contents[3].text
    print(card_name + "\n" + cardP_T + "\n" + cardType + "\n")

# Trying to extract the next URL after this page, but there is not always a next page to retrieve, so an exception(IndexError) is produced due to our tries to access an index in a list that is empty, zero index is not available
try:
    URL_Next = "xyz.com/" + page_soup.findAll("li", {"class": 
"next"})[0].contents[0].get("href")
except IndexError:
    # End of crawling because of IndexError! Means that there is no next 
#page to crawl
    print("Crawling process completed! No more infomation to retrieve!")
else:
    print("The nex t URL is: " + URL_Next + "\n")
    NumOfCrawledPages += 1
finally:
    print("Moving to page : " + str(NumOfCrawledPages + 1) + "\n")

# We need to find the overall card number available, to find the number of 
#pages that we need to crawl
# we drag those infomation from a "div" tag with class "summary"

OverallCardInfo = (page_soup.find("div", {"class": "summary"})).text
TotalCardNum = int(re.findall("\d+", OverallCardInfo)[2])
NumOfPages = NumOfNextPages(TotalCardNum, CardsPerPage)

有了这个,我可以抓取第一页,我手动给出,并提取一些我需要抓取的页面总数以及下一个网址的信息。

最终我想给出一个起点(网页),然后抓取工具会自动进入其他网页。所以我使用了以下for循环:

for i in range(0, NumOfPages):
    # The number of items shown by the search option on xyz.com can 
    #not be more than 10000
    if ((NumOfCrawledPages + 1) * CardsPerPage) >= 10000:
        print("Number of results provided can not exceed 10000!\nEnd of the 
crawling!")
        break

    if i == 0:
       Url = InitURL
    else:
        Url = URL_Next

    # opening up connection and crabbing the page
    UClient = uReq(Url)  # downloading the url
    page_html = UClient.read()
    UClient.close()

    # html parsing
    page_soup = soup(page_html, "html.parser")

    # Finds all the cards that exist in the webpage and stores them as a bs4 
#object
    cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})

    # Selects the card names, Power and Toughness, Set that they belong
    for card in cards:

        card_name = 
card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")

        if len(card.div.contents) > 3:
            cardP_T = card.div.contents[3].contents[1].text.replace("\n", 
"").strip()
        else:
            cardP_T = "Does not exist"

        cardType = card.contents[3].text
        print(card_name + "\n" + cardP_T + "\n" + cardType + "\n")

    # Trying to extract the next URL after this page, but there is not our #tries to access an index in a list that is empty, zero index is not available
    try:
        URL_Next = "xyz.com" + page_soup.findAll("li", {"class": "next"})[0].contents[0].get("href")
    except IndexError:
        # End of crawling because of IndexError! Means that there is no next #page to crawl
        print("Crawling process completed! No more infomation to retrieve!")
    else:
        print("The next URL is: " + URL_Next + "\n")
        NumOfCrawledPages += 1
        Url = URL_Next
    finally:
        print("Moving to page : " + str(NumOfCrawledPages + 1) + "\n")

带有额外for循环的第二个代码运行时没有错误,但结果不是预期的结果。它返回我手动输入的第一页的爬行结果,并且不会在其他页面中继续进行...

为什么会这样?

预期输出类似于:

Dragonspeaker萨满 P / T:2/2 生物 - 人类野蛮人萨满

Dragonspeaker萨满 P / T:2/2 生物 - 人类野蛮人萨满

巨龙 P / T:3/3 生物 - 鸟类士兵

下一个网址是:xyz.com /......

转到页面:2

---------------------------------------------结束第一页抓取

Dragonspeaker萨满 P / T:2/2 生物 - 人类野蛮人萨满

Dragonspeaker萨满 P / T:2/2 生物 - 人类野蛮人萨满

巨龙 P / T:3/3 生物 - 鸟类士兵

下一个网址是:xyz.com /......

转到第3页

从手动给定的网页中检索此信息后,应继续保存在for循环中Url变量的下一页。相反,它会一次又一次地继续抓取同一页面。计数器工作得很好,因为它计算了已爬网的页数,但Url变量似乎没有改变值。

0 个答案:

没有答案