与Scrapy

时间:2018-02-02 16:56:06

标签: python web-scraping beautifulsoup scrapy python-requests

我试图从此website抓取数据。我需要单击每个公司名称,然后提取右侧显示的数据。我无法通过正常请求实现它,不得不使用会话来管理cookie。有了请求和BeautifulSoup,我会这样做

import requests
from bs4 import BeautifulSoup
import re

start_url = r"http://directoriosancionados.funcionpublica.gob.mx/SanFicTec/jsp/Ficha_Tecnica/SancionadosN.jsp?cmdsan=ALL&tipoqry=ALL&mostrar_msg=SI"

s = requests.Session()
response=s.post(start_url)
soup = BeautifulSoup(response.text)
links = soup.find_all("a", {"onclick":pattern})
onclicks = [link["onclick"] for link in links]
for element in onclicks[:10]:
    expe = re.search(string=element, pattern=r"\d+/\d+").group(0)
    r = s.post(url="http://directoriosancionados.funcionpublica.gob.mx/SanFicTec/jsp/Ficha_Tecnica/FichaSinTabla.jsp",
      data={"expe":expe,
              "tipo":"1",
              "persona":"3"}).text
    soup = BeautifulSoup(r)
    something = soup.find("p", {"class":"normal"})
    print(something)

现在我想知道scrapy中是否有类似的东西:

class Spider:
    def get_expe():
        #get the list of expe

    def make_requests():
        #use the same session and make post requests for each expe

    def parse():
        #extract the data

我当然不希望你为我写蜘蛛。任何有关如何在会话中使用相同cookie的帮助都会有所帮助。

1 个答案:

答案 0 :(得分:2)

我认为除了Scrapy默认提供的功能之外,您还需要处理任何特殊的Cookie。查看您的方案的最小工作示例:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy


class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['http://directoriosancionados.funcionpublica.gob.mx/SanFicTec/jsp/Ficha_Tecnica/SancionadosN.jsp?cmdsan=ALL&tipoqry=ALL&mostrar_msg=SI/']

    def parse(self, response):
        for link in response.xpath('//table//tr//a'):
            data = {
                'expe': link.xpath('./@onclick').re_first(r'\d+/\d+'),
                'tipo': '1',
                'persona': '3'
            }
            yield scrapy.FormRequest('http://directoriosancionados.funcionpublica.gob.mx/SanFicTec/jsp/Ficha_Tecnica/FichaSinTabla.jsp',
                                     formdata=data, callback=self.parse_detail)

    def parse_detail(self, response):
        yield {
            'infractor': response.xpath('(//p[@class="normal"])[1]/text()').extract_first()
        }