Question

我创建的抓取工具是从网页中获取名称和网址。现在，我无法让我的抓取工具使用next_page生成的链接从下一页获取数据。我使用类创建爬虫是一件非常新的事情，因为我无法进一步思考。我已经采取主动在我的代码中略微扭曲但它既没有带来任何结果也没有抛出任何错误。希望有人会看一下。

import requests
from lxml import html

class wiseowl:
    def __init__(self,start_url):
        self.start_url=start_url
        self.storage=[]

    def crawl(self):
        self.get_link(self.start_url)

    def get_link(self,link):
        url="http://www.wiseowl.co.uk"
        response=requests.get(link)
        tree=html.fromstring(response.text)
        name=tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']/a/text()")
        urls=tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']/a/@href")
        docs=(name,urls)
        self.storage.append(docs)

        next_page=tree.xpath("//div[contains(concat(' ', @class, ' '), ' woPaging ')]//a[@class='woPagingItem']/@href")
        for npage in next_page:
            if npage is not None:
                self.get_link(url+npage)


    def __str__(self):
        return "{}".format(self.storage)


crawler=wiseowl("http://www.wiseowl.co.uk/videos/")
crawler.crawl()
for item in crawler.storage:
    print(item)

Answer 1

我修改了课程的某些部分，试一试：

class wiseowl:
    def __init__(self,start_url):
        self.start_url=start_url
        self.links = [ self.start_url ]    #  a list of links to crawl # 
        self.storage=[]

    def crawl(self): 
        for link in self.links :    # call get_link for every link in self.links #
            self.get_link(link)

    def get_link(self,link):
        print('Crawling: ' + link)
        url="http://www.wiseowl.co.uk"
        response=requests.get(link)
        tree=html.fromstring(response.text)
        name=tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']/a/text()")
        urls=tree.xpath("//p[@class='woVideoListDefaultSeriesTitle']/a/@href")
        docs=(name,urls)
        #docs=(name, [url+u for u in urls])    # use this line if you want to join the urls # 
        self.storage.append(docs)
        next_page=tree.xpath("//div[contains(concat(' ', @class, ' '), ' woPaging ')]//*[@class='woPagingItem' or @class='woPagingNext']/@href")    # get links form 'woPagingItem' or 'woPagingNext' # 
        for npage in next_page:
            if npage and url+npage not in self.links :    # don't get the same link twice # 
                self.links += [ url+npage ]

    def __str__(self):
        return "{}".format(self.storage)

crawler=wiseowl("http://www.wiseowl.co.uk/videos/")
crawler.crawl()
for item in crawler.storage:
    item = zip(item[0], item[1])
    for i in item : 
        print('{:60} {}'.format(i[0], i[1]))    # you can change 60 to the value you want #

Answer 2

您应该考虑使用某种类型的数据结构来保存访问链接（以避免无限循环）以及您尚未访问的链接的容器。爬行本质上是对互联网的广泛搜索。因此，您应首先搜索广度，以便更好地了解基础算法。

为您需要访问的链接实施队列。每次访问链接时，请抓取所有链接的页面并将每个链接排入队列。
在Python或字典中实现一个集合，以检查您入队的每个链接是否已被访问过，如果已访问过，请不要将其排入队列。

您的抓取方法应该类似于：

def crawler(self):
while len(self.queue):
curr_link = self.queue.pop(0)
# process curr_link here -> scrape and add more links to queue
# mark curr_link  as visited

无法推动生成的下一页链接以递归方式进行爬网

2 个答案: