Question

我正在尝试使用scrapy从此链接https://www.flatstats.co.uk/racing-system-builder.php抓取数据。

我想使用scrapy自动化ajax调用。当我点击“Full SP”按钮（在Firebug中检查）时，post参数有sql字符串，这是“奇怪的” race|2|eq|Ordinary|0|~tRIDER_TYPE 这是什么方言？

我的代码：

import scrapy
import urllib

class FlatStat(scrapy.Spider):

    name= "flatstat"
    allowed_domains = ["flatstats.co.uk"]
    start_urls = ["https://www.flatstats.co.uk/racing-system-builder.php"]

    def parse(self, response):

        query_lst = response.xpath('//table[@id="system"]//tr/td[last()]/text()').extract()
        query_str = ' '.join(query_lst)

        url = 'https://www.flatstats.co.uk/ajax/sb_report.php'

        body_dict = {'a_e_max': '9.99',
                     'a_e_min': '0',
                     'arch_min': '0',
                     'exp_min': '0',
                     'report_type':'S',
                     # copied from the Post parameters by inspecting. Actually I tried everything.
                     'sqlFullString' : u'''Type%20(Rider)%7C%3D%7COrdinary%20(Exclude%20Amatr%2C%20App%2C%20Lady%20Races
                                         )%7CAND%7Crace%7C2%7C0%7COrdinary%7C0%7C~tRIDER_TYPE%7C-t%7Ceq''',
                     #I tried copying this from the post parameters as well but no success.
                     #I also tried sql from the table //td text() which is "normal" sql but no success
                     'sqlString': query_str}

        #here i tried everything FormRequest as well though there is no form.
        return scrapy.Request(url, method="POST", body=urllib.urlencode(body_dict), callback=self.parse_page)


    def parse_page(self, response):

        with open("response.html", "w") as f:
            f.write(response.body)

所以问题是：

这是什么SQL。
1. 为什么不回复我所需要的页面。如何运行正确的查询？
2. 我也尝试过Selenium点击按钮让它自己动手，但这是另一个不成功的故事。 :(

Answer 1

说明网站创建者对提交的sqlString做了什么并不容易。它可能意味着对后端处理数据的方式非常具体。

这是HTML-in HTML代码页面的摘录：

...
    function system_report(type) {

        sqlString = '', sqlFullString = '', rowcount = 0;

        $('#system tr').each(function() {
            if(rowcount > 0) {
                var editdata = this.cells[6].innerHTML.split("|");
                sqlString += editdata[0] + '|' + editdata[1] + '|' + editdata[7] + '|' + editdata[3] + '|' + editdata[4] + '|' + editdata[5] + '^';
                sqlFullString += this.cells[0].innerHTML + '|' + encodeURIComponent(this.cells[1].innerHTML) + '|' + this.cells[2].innerHTML + '|' + this.cells[3].innerHTML + '|' + this.cells[6].innerHTML + '^';
            }
            rowcount++;         
        });
        sqlString = sqlString.slice(0, -1)
...

反向工程看起来非常重要。

虽然它不是你＆＃34; sql＆＃34;的解决方案。上面的问题，我建议您尝试使用splash (an alternative to selenium in some cases).

你可以用docker（最简单的方法）启动它：

$ sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash

使用以下脚本：

function main(splash)
  local url = splash.args.url
  assert(splash:go(url))
  assert(splash:wait(0.5))

  -- this clicks the "Full SP" button
  assert(splash:runjs("$('#b-full-report').click()"))
  -- loading the report takes some time
  assert(splash:wait(5))
  return {
    html = splash:html()
  }
end

您可以使用报告的弹出窗口获取页面HTML。

您可以使用scrapyjs（a.k.a scrapy-splash）

将Splash与Scrapy集成

请参阅https://stackoverflow.com/a/35851072/，并举例说明如何使用自定义脚本执行此操作。

scrapy无法弄清楚ajax调用

1 个答案: