Question

我正在尝试构建一个爬网程序，该爬网程序将通过跟踪其首页中的所有链接，然后在新页面中重复此操作来对网站列表进行爬网。我认为我可能不正确地使用rules属性。蜘蛛程序从不调用处理器方法。它表明没有链接，也没有错误消息。我已经省略了一些函数来显示我对添加爬网所做的更改。我正在使用Scrapy 1.5

strip

Answer 1

尝试在代码后添加并更改回调以解析：

def start_requests(self):
    self.inf = DataInterface()
    df = self.inf.searchData()

    row = df.iloc[2]
    print(row)
    #url = 'http://' + row['Website'].lower()
    #self.rules.append()
    url = 'http://example.com/Page.php?ID=7'
    req = scrapy.http.Request(url=url, callback=self.parse,
                            meta={'index': 1, 'depth': 0,
                                'firstName': row['First Name'],
                                'lastName': row['Last Name'],
                                'found': {}, 'title': row['Title']})
    yield req

def parse(self, response):
    print(response.text)

在Scrapy中使用CrawlSpider规则

1 个答案: