Scrapy登录适用于某些网站,但不适用于其他网站

时间:2019-09-13 20:01:58

标签: python scrapy

我可以使用以下代码登录到Github。但是,当我在其他网站上尝试相同的代码时,它仍然保留在登录页面上,无法登录。我有想念吗?

  • Github
class GithubSpider(scrapy.Spider):

    name = 'test'
    start_urls = ['https://github.com/login'] 

    def parse(self, response):
        token = response.css('form input::attr(value)').extract_first()
        return FormRequest.from_response(response,
                                         formdata=
                                         {
                                             'csrf_token': token,
                                             'login': '*******',
                                             'password': '*******'
                                         },
                                         callback=self.start_scraping)

    def start_scraping(self, response):
        open_in_browser(response)
        print('yes')
  • 航空公司网站
class AirlineSpider(scrapy.Spider):

    name = 'test'
    allowed_domains = ['hawaiianairlines.com']
    start_urls = ['https://www.hawaiianairlines.com/my-account/login/']

    def parse(self, response):
        token = response.css('form input::attr(value)').extract_first()
        return FormRequest.from_response(response,
                                         formdata=
                                         {
                                             'csrf_token': token,
                                             'UserName': '*********',
                                             'Password': '*********'
                                         },
                                         callback=self.start_scraping)

    def start_scraping(self, response):
        open_in_browser(response)
        print('yes')

1 个答案:

答案 0 :(得分:1)

查看您提供的网站的来源,您可以在此处找到表单字段:

[Sat Sep 14 00:04:31.016107 2019] [wsgi:error] [pid 4693:tid 140670017648384] [remote 66.249.66.44:60013]   File "/var/www/xxxx/__init__.py", line 1612, in showArticleTitlePage
[Sat Sep 14 00:04:31.016119 2019] [wsgi:error] [pid 4693:tid 140670017648384] [remote 66.249.66.44:60013]     language = session['locale'])
[Sat Sep 14 00:04:31.016130 2019] [wsgi:error] [pid 4693:tid 140670017648384] [remote 66.249.66.44:60013]   File "/var/www/cardsrealm/SQL.py", line 1376, in get_articles_before
[Sat Sep 14 00:04:31.016146 2019] [wsgi:error] [pid 4693:tid 140670017648384] [remote 66.249.66.44:60013]     cursor.execute(query, t)
[Sat Sep 14 00:04:31.016176 2019] [wsgi:error] [pid 4693:tid 140670017648384] [remote 66.249.66.44:60013]   File "/var/www/xxxx/venv/lib/python3.6/site-packages/pymysql/cursors.py", line 170, in execute
[Sat Sep 14 00:04:31.016188 2019] [wsgi:error] [pid 4693:tid 140670017648384] [remote 66.249.66.44:60013]     result = self._query(query)
[Sat Sep 14 00:04:31.016198 2019] [wsgi:error] [pid 4693:tid 140670017648384] [remote 66.249.66.44:60013]   File "/var/www/xxxx/venv/lib/python3.6/site-packages/pymysql/cursors.py", line 328, in _query
[Sat Sep 14 00:04:31.016209 2019] [wsgi:error] [pid 4693:tid 140670017648384] [remote 66.249.66.44:60013]     conn.query(q)
[Sat Sep 14 00:04:31.016219 2019] [wsgi:error] [pid 4693:tid 140670017648384] [remote 66.249.66.44:60013]   File "/var/www/xxxx/venv/lib/python3.6/site-packages/pymysql/connections.py", line 517, in query
[Sat Sep 14 00:04:31.016230 2019] [wsgi:error] [pid 4693:tid 140670017648384] [remote 66.249.66.44:60013]     self._affected_rows = self._read_query_result(unbuffered=unbuffered)
[Sat Sep 14 00:04:31.016241 2019] [wsgi:error] [pid 4693:tid 140670017648384] [remote 66.249.66.44:60013]   File "/var/www/xxxx/venv/lib/python3.6/site-packages/pymysql/connections.py", line 732, in _read_query_result
[Sat Sep 14 00:04:31.016251 2019] [wsgi:error] [pid 4693:tid 140670017648384] [remote 66.249.66.44:60013]     result.read()
[Sat Sep 14 00:04:31.016262 2019] [wsgi:error] [pid 4693:tid 140670017648384] [remote 66.249.66.44:60013]   File "/var/www/xxxx/venv/lib/python3.6/site-packages/pymysql/connections.py", line 1075, in read
[Sat Sep 14 00:04:31.016273 2019] [wsgi:error] [pid 4693:tid 140670017648384] [remote 66.249.66.44:60013]     first_packet = self.connection._read_packet()
[Sat Sep 14 00:04:31.016297 2019] [wsgi:error] [pid 4693:tid 140670017648384] [remote 66.249.66.44:60013]   File "/var/www/xxxx/venv/lib/python3.6/site-packages/pymysql/connections.py", line 674, in _read_packet
[Sat Sep 14 00:04:31.016310 2019] [wsgi:error] [pid 4693:tid 140670017648384] [remote 66.249.66.44:60013]     recv_data = self._read_bytes(bytes_to_read)
[Sat Sep 14 00:04:31.016321 2019] [wsgi:error] [pid 4693:tid 140670017648384] [remote 66.249.66.44:60013]   File "/var/www/xxxx/venv/lib/python3.6/site-packages/pymysql/connections.py", line 688, in _read_bytes
[Sat Sep 14 00:04:31.016332 2019] [wsgi:error] [pid 4693:tid 140670017648384] [remote 66.249.66.44:60013]     self._sock.settimeout(self._read_timeout)
[Sat Sep 14 00:04:31.016342 2019] [wsgi:error] [pid 4693:tid 140670017648384] [remote 66.249.66.44:60013] AttributeError: 'NoneType' object has no attribute 'settimeout'

这尤其是self.conn = self.create_connection() cursor = self.conn.cursor() query = ''' SELECT MY QUERY '''.format(self.xxxColumn, self.xxxColumn) cursor.execute(query, t) result = cursor.fetchall() cursor.close() self.close_connection() 部分告诉您,该表单将永远不会通过浏览器通过常规方法提交给“ href”目标,并且(除非您在未启用JavaScript的情况下对表单进行了测试,并且它可以正常工作),该网站可能只能在启用了javascript的浏览器中使用。

然后通过XHR找出实际提交表单的方式,您需要在站点的javascript代码中查找并分解connect_timeout = 200 read_timeout = 200 write_timeout = 200 max_allowed_packet = 1073741824 函数,并在您的代码中进行仿真。

另一种选择是使用带有scrapy的javascript引擎,该引擎可以为您处理脚本,但是其缺点是资源消耗大并且可能难以设置。