我在python scrapy中写了一个小刮刀来解析网页上的不同名称。该页面通过分页遍历了4页。整个页面的总名称是46,但它正在抓36个名字。
刮刀应该跳过第一个着陆页的内容,但在我的刮刀中使用parse_start_url
参数我已经处理过了。
然而,我现在面对的问题是,它出乎意料地跳过了第二页的内容并解析了所有其他内容,我的意思是第一页,第三页,第四页等等。它为什么会发生以及如何处理?提前致谢。
这是我正在尝试的脚本:
import scrapy
class DataokSpider(scrapy.Spider):
name = "dataoksp"
start_urls = ["https://data.ok.gov/browse?page=1&f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191"]
def parse(self, response):
for link in response.css('.pagination .pager-item a'):
new_link = link.css("::attr(href)").extract_first()
yield scrapy.Request(url=response.urljoin(new_link), callback=self.target_page)
def target_page(self, response):
parse_start_url = self.target_page # I used this argument to capture the content of first page
for titles in response.css('.title a'):
name = titles.css("::text").extract_first()
yield {'Name':name}
答案 0 :(得分:1)
解决方案非常简单。我已经修好了。
import scrapy
class DataokSpider(scrapy.Spider):
name = "dataoksp"
start_urls = ["https://data.ok.gov/browse?f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191"]
def parse(self, response):
for f_link in self.start_urls:
yield response.follow(url=f_link, callback=self.target_page) #this is line which fixes the issue
for link in response.css('.pagination .pager-item a'):
new_link = link.css("::attr(href)").extract_first()
yield response.follow(url=new_link, callback=self.target_page)
def target_page(self, response):
for titles in response.css('.title a'):
name = titles.css("::text").extract_first()
yield {'Name':name}
现在它给了我所有的结果。
答案 1 :(得分:0)
因为您在 start_urls 中指定的链接实际上是第二页的链接。如果您打开它,您会看到当前页面没有<a>
标记。这就是第2页未达到target_page
的原因,因此,您应该将 start_urls 指向:
https://data.ok.gov/browse?f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191
此代码可以帮助您:
import scrapy
from scrapy.http import Request
class DataokspiderSpider(scrapy.Spider):
name = 'dataoksp'
allowed_domains = ['data.ok.gov']
start_urls = ["https://data.ok.gov/browse?f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191",]
def parse(self, response):
for titles in response.css('.title a'):
name = titles.css("::text").extract_first()
yield {'Name':name}
next_page = response.xpath('//li[@class="pager-next"]/a/@href').extract_first()
if next_page:
yield Request("https://data.ok.gov{}".format(next_page), callback=self.parse)
统计信息(请参阅item_scraped_count
):
{
'downloader/request_bytes': 2094,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'downloader/response_bytes': 45666,
'downloader/response_count': 6,
'downloader/response_status_count/200': 6,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 9, 19, 7, 23, 47, 801934),
'item_scraped_count': 46,
'log_count/DEBUG': 53,
'log_count/INFO': 7,
'memusage/max': 47509504,
'memusage/startup': 47509504,
'request_depth_max': 4,
'response_received_count': 6,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2017, 9, 19, 7, 23, 46, 59360)
}