Question

我想使用Mechanize（使用Python）提交表单，但遗憾的是页面编码错误且<select>元素实际上不在<form>标记内。

所以我不能通过以下形式使用传统方法：

forms = [f for f in br.forms()]
mycontrol = forms[1].controls[0]

我该怎么做？

以下是page I would like to scrape以及相关的代码 - 我对la选择项感兴趣：

    <fieldset class="searchField">
      <label>By region / local authority</label>
      <p id="regp">
        <label>Region</label>
        <select id="region" name="region"><option></option></select>
      </p>
      <p id="lap">
        <label>Local authority</label>
        <select id="la" name="la"><option></option></select>
      </p>
      <input id="byarea" type="submit" value="Go" />
      <img id="regmap" src="/schools/performance/img/map_england.png" alt="Map of regions in England" border="0" usemap="#England" />
    </fieldset>

Answer 1

这实际上是您认为的更复杂，但仍然易于实施。发生的事情是你链接的网页是由JSON拉入本地权限（这就是为什么name="la" select元素没有填充缺少Javascript的Mechanize的原因）。最简单的方法是直接使用Python请求这个JSON数据，并使用结果直接转到每个数据页。

import urllib2
import json

#The URL where we get our array of LA data
GET_LAS = 'http://www.education.gov.uk/cgi-bin/schools/performance/getareas.pl?level=la&code=0'

#The URL which we interpolate the LA ID into to get individual pages
GET_URL = 'http://www.education.gov.uk/schools/performance/geo/la%s_all.html'

def get_performance(la):
    page = urllib2.urlopen(GET_URL % la)
    #print(page.read())

#get the local authority list
las = json.loads(urllib2.urlopen(GET_LAS).read())

for la in las:
    if la != 0:
        print('Processing LA ID #%s (%s)' % (la[0], la[1]))
        get_performance(la[0])

正如您所看到的，您甚至不需要加载您链接的页面或使用Mechanize来执行此操作！但是，您仍然需要一种方法来解析学校名称，然后是绩效数据。

将Mechanize与不在表单内的选择字段一起使用？

1 个答案: