这是我的蜘蛛
class Spider(scrapy.Spider):
name = "spider"
start_urls = []
with open("clause/clauses.txt") as f:
for line in f:
start_urls(line)
base_url = "<url>"
start_urls = [base_url + "-".join(url.split()) for url in start_url]
def start_requests(self):
self.log("start_urls - {}".format(self.start_urls))
for url in self.start_urls:
yield scrapy.Request(url, dont_filter=True, priority=2, callback=self.parse)
def parse(self, response):
text_items = response.css("some css").extract()
for text in text_items:
if text == "\n":
continue
yield Item({"text" : text})
yield response.follow(response.css("a::attr(href)").extract_first(), callback=self.parse)
有20个开始网址,但我注意到只有前4个网址实际被调用,其余的网址都没有被执行。理想的行为是争抢第一次调用所有20个开始网址,然后从每个继续到下一个网页。
答案 0 :(得分:0)
看起来你有一个错字:
start_urls = [base_url + "-".join(url.split()) for url in start_url]
应该是:
start_urls = [base_url + "-".join(url.split()) for url in start_urls]
请注意s
中缺少的start_urls
。
我怀疑这个:
with open("clause/clauses.txt") as f:
for line in f:
start_urls(line)
应该是:
with open("clause/clauses.txt") as f:
for line in f:
start_urls.append(line)