Question

我尝试抓取网址并检索每个网址的 h1 。该URL存储在文本文件中。代码是：

class MySpider(CrawlSpider):
    name = "sitemaplocation"
    allowed_domains = ["xyz.nl"]
    f = open("locationlist.txt",'r')
    start_urls = [url.strip() for url in f.readlines()]
    f.close()


def parse(self, response):
    sel = Selector(response)

    title= sel.xpath("//h1[@class='no-bd']/text()").extract()
    print title

代码遍历网站但不打印任何内容。任何帮助都会有用。

Answer 1

尝试放置此内容：

name = "sitemaplocation"
allowed_domains = ["xyz.nl"]
f = open("locationlist.txt",'r')
start_urls = [url.strip() for url in f.readlines()]
f.close()

到

__init__

MySpider类中的

方法。

还有你在哪里调用解析函数？

Answer 2

尝试从Spider而不是CrawlSpider继承您的蜘蛛：

编写爬网蜘蛛规则时，请避免使用parse作为回调 CrawlSpider使用parse方法本身来实现其逻辑。因此，如果您覆盖解析方法，则爬行蜘蛛将不再存在工作

Scrapy start_urls在文本文件中

2 个答案: