Question

我有一个我需要抓取的urls列表，因此我可以将这些页面编入索引并将其添加到我的searchindex.db

urls = ['http://consequenceofsound.net/', 'http://www.tinymixtapes.com/', 'https://www.residentadvisor.net/']

这就是我初始化我的抓取工具类的方法：

class crawler:
    # Initialize the crawler with the name of database
    def __init__(self,dbname):
        self.con = sqlite3.connect(dbname)

    def __del__(self):
        self.con.close()

    def dbcommit(self):
        self.con.commit()

这是抓取方法：

def crawl(self,pages,depth=2):

            (...)
            #code here that opens and adds links to database    
            (...)       

            self.dbcommit() 

            pages=newpages

这里我实例化我的crawler class：

crawler=crawler('searchindex.db')

pagelist = url[0]

crawler.crawl(pagelist)

如何安排url crawling和页面索引，以便每个抓取索引过程在最后一个完成或因任何原因中断后恢复？

安排SQL的工作

0 个答案: