到目前为止对我来说还可以。我的问题是如何进一步删除此URL列表?搜索之后,我知道我可以在解析中返回一个请求,但似乎只能处理一个URL。
这是我的解析:
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
return scrapy.Request(list[0])
# It works, but how can I continue b.com and c.com?
我可以这样做吗?
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
scrapy.Request(link)
# This is wrong, though I need something like this
完整版:
import scrapy
class MySpider(scrapy.Spider):
name = "mySpider"
allowed_domains = ["x.com"]
start_urls = ["http://x.com"]
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
scrapy.Request(link)
# This is wrong, though I need something like this
答案 0 :(得分:5)
我认为您正在寻找的是收益率声明:
def parse(self, response):
# Get the list of URLs, for example:
list = ["http://a.com", "http://b.com", "http://c.com"]
for link in list:
request = scrapy.Request(link)
yield request
答案 1 :(得分:1)
为此,您需要创建scrapy.spider
的子类并定义要开始的URL列表。然后,Scrapy将自动跟踪它找到的链接。
做这样的事情:
import scrapy
class YourSpider(scrapy.Spider):
name = "your_spider"
allowed_domains = ["a.com", "b.com", "c.com"]
start_urls = [
"http://a.com/",
"http://b.com/",
"http://c.com/",
]
def parse(self, response):
# do whatever you want
pass
您可以找到有关Scrapy official documentation的更多信息。
答案 2 :(得分:0)
# within your parse method:
urlList = response.xpath('//a/@href').extract()
print(urlList) #to see the list of URLs
for url in urlList:
yield scrapy.Request(url, callback=self.parse)
这应该有效