我试图从此website抓取数据。我需要单击每个公司名称,然后提取右侧显示的数据。我无法通过正常请求实现它,不得不使用会话来管理cookie。有了请求和BeautifulSoup,我会这样做
import requests
from bs4 import BeautifulSoup
import re
start_url = r"http://directoriosancionados.funcionpublica.gob.mx/SanFicTec/jsp/Ficha_Tecnica/SancionadosN.jsp?cmdsan=ALL&tipoqry=ALL&mostrar_msg=SI"
s = requests.Session()
response=s.post(start_url)
soup = BeautifulSoup(response.text)
links = soup.find_all("a", {"onclick":pattern})
onclicks = [link["onclick"] for link in links]
for element in onclicks[:10]:
expe = re.search(string=element, pattern=r"\d+/\d+").group(0)
r = s.post(url="http://directoriosancionados.funcionpublica.gob.mx/SanFicTec/jsp/Ficha_Tecnica/FichaSinTabla.jsp",
data={"expe":expe,
"tipo":"1",
"persona":"3"}).text
soup = BeautifulSoup(r)
something = soup.find("p", {"class":"normal"})
print(something)
现在我想知道scrapy中是否有类似的东西:
class Spider:
def get_expe():
#get the list of expe
def make_requests():
#use the same session and make post requests for each expe
def parse():
#extract the data
我当然不希望你为我写蜘蛛。任何有关如何在会话中使用相同cookie的帮助都会有所帮助。
答案 0 :(得分:2)
我认为除了Scrapy默认提供的功能之外,您还需要处理任何特殊的Cookie。查看您的方案的最小工作示例:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
start_urls = ['http://directoriosancionados.funcionpublica.gob.mx/SanFicTec/jsp/Ficha_Tecnica/SancionadosN.jsp?cmdsan=ALL&tipoqry=ALL&mostrar_msg=SI/']
def parse(self, response):
for link in response.xpath('//table//tr//a'):
data = {
'expe': link.xpath('./@onclick').re_first(r'\d+/\d+'),
'tipo': '1',
'persona': '3'
}
yield scrapy.FormRequest('http://directoriosancionados.funcionpublica.gob.mx/SanFicTec/jsp/Ficha_Tecnica/FichaSinTabla.jsp',
formdata=data, callback=self.parse_detail)
def parse_detail(self, response):
yield {
'infractor': response.xpath('(//p[@class="normal"])[1]/text()').extract_first()
}