Scrapy只会刮掉前4个起始网址

时间:2017-11-17 20:25:06

标签: scrapy

这是我的蜘蛛

class Spider(scrapy.Spider):
    name = "spider"
    start_urls = []
    with open("clause/clauses.txt") as f:
        for line in f:
            start_urls(line)
    base_url = "<url>"
    start_urls = [base_url + "-".join(url.split()) for url in start_url]

    def start_requests(self):
        self.log("start_urls - {}".format(self.start_urls))
        for url in self.start_urls:
            yield scrapy.Request(url, dont_filter=True, priority=2, callback=self.parse)

    def parse(self, response):
        text_items = response.css("some css").extract()

        for text in text_items:
            if text == "\n":
                continue
            yield Item({"text" : text})

        yield response.follow(response.css("a::attr(href)").extract_first(), callback=self.parse)

有20个开始网址,但我注意到只有前4个网址实际被调用,其余的网址都没有被执行。理想的行为是争抢第一次调用所有20个开始网址,然后从每个继续到下一个网页。

1 个答案:

答案 0 :(得分:0)

看起来你有一个错字:

 start_urls = [base_url + "-".join(url.split()) for url in start_url]

应该是:

 start_urls = [base_url + "-".join(url.split()) for url in start_urls]

请注意s中缺少的start_urls

我怀疑这个:

with open("clause/clauses.txt") as f:
    for line in f:
        start_urls(line)

应该是:

with open("clause/clauses.txt") as f:
    for line in f:
        start_urls.append(line)