我在PHP中编写过去3年的数据抓取脚本。
这是一个简单的PHP脚本
$url = 'https://appext20.dos.ny.gov/corp_public/CORPSEARCH.SELECT_ENTITY';
$fields = array(
'p_entity_name' => urlencode('AAA'),
'p_name_type' => urlencode('A'),
'p_search_type' => urlencode('BEGINS')
);
//url-ify the data for the POST
foreach ($fields as $key => $value) {
$fields_string .= $key . '=' . $value . '&';
}
$fields_string = rtrim($fields_string, '&');
//open connection
$ch = curl_init();
//set the url, number of POST vars, POST data
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_POST, count($fields));
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields_string);
//execute post
$result = curl_exec($ch);
print curl_error($ch) . '<br>';
print curl_getinfo($ch, CURLINFO_HTTP_CODE) . '<br>';
print $result;
仅当CURLOPT_SSL_VERIFYPEER
为false
时,它才能正常工作。如果我们启用CURLOPT_SSL_VERIFYPEER
或使用http
而不是https
,则会返回空响应。
但是,我必须在Python Scrapy中执行相同的项目,这里是Scrapy中的相同代码。
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http.request import Request
import urllib
from appext20.items import Appext20Item
class Appext20Spider(CrawlSpider):
name = "appext20"
allowed_domains = ["appext20.dos.ny.gov"]
DOWNLOAD_HANDLERS = {
'https': 'my.custom.downloader.handler.https.HttpsDownloaderIgnoreCNError',}
def start_requests(self):
payload = {"p_entity_name": 'AMEB', "p_name_type": 'A', 'p_search_type':'BEGINS'}
url = 'https://appext20.dos.ny.gov/corp_public/CORPSEARCH.SELECT_ENTITY'
yield Request(url, self.parse_data, method="POST", body=urllib.urlencode(payload))
def parse_data(self, response):
print('here is repos')
print response
返回空响应。它需要被禁用SSL验证。
请原谅我在Python Scrapy中缺乏知识,我已经搜索了很多关于它但没有找到任何解决方案。
答案 0 :(得分:1)
我建议您查看此页面:http://doc.scrapy.org/en/1.0/topics/settings.html您可以更改模块的行为方式并更改各种处理程序的设置。
我也相信这是一个重复的问题:Disable SSL certificate verification in Scrapy
HTHS
谢谢,
// P
答案 1 :(得分:0)
这段代码对我有用
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import FormRequest
import urllib
from appext20.items import Appext20Item
from scrapy.selector import HtmlXPathSelector
class Appext20Spider(CrawlSpider):
name = "appext20"
allowed_domains = ["appext20.dos.ny.gov"]
payload = {"p_entity_name": 'AME', "p_name_type": 'A', 'p_search_type':'BEGINS'}
def start_requests(self):
url = 'https://appext20.dos.ny.gov/corp_public/CORPSEARCH.SELECT_ENTITY'
return [ FormRequest(url,
formdata= self.payload,
callback=self.parse_data) ]
def parse_data(self, response):
print('here is response')
questions = HtmlXPathSelector(response).xpath("//td[@headers='c1']")
# print questions
all_links = []
for tr in questions:
temp_dict = {}
temp_dict['link'] = tr.xpath('a/@href').extract()
temp_dict['title'] = tr.xpath('a/text()').extract()
all_links.extend([temp_dict])
print (all_links)