我必须抓取here中提供的所有选项。使用mechanize我选择了前两个控件(报表类型和语言)。现在有三个下拉列表。第二个依赖于第一个而第三个依赖于第二个。我怎么解决它。前两个字段的起始代码如下所示
import mechanize
from bs4 import BeautifulSoup
br = mechanize.Browser()
url="http://ceojk.nic.in/ElectionPDF/Main.aspx"
response = br.open(url)
br.select_form(name="Form1")
control_1 = br.form.find_control("RadioButtonList1")
control_2 = br.form.find_control("RadioButtonList2")
submit = br.form.find_control("Button1")
br[control_1.name]=["PS Wise Report"]
br[control_2.name]=["English"]
response = br.submit()
soup=BeautifulSoup(response,'lxml')
for item in soup.find_all('option'):
print item['value']
答案 0 :(得分:1)
好的,调试非常令人兴奋(你无法想象我在试图解决它时尝试和学习了多少东西)。
这是模拟浏览器中的行为的工作代码,逐步选择第一个区,AC和PS(只传递["1"]
值 - 您可能需要改进它 - 例如,阅读选项并添加选项名称 - >值图:)
import mechanize
from bs4 import BeautifulSoup
br = mechanize.Browser()
url = "http://ceojk.nic.in/ElectionPDF/Main.aspx"
response = br.open(url)
br.select_form(name="Form1")
br["RadioButtonList1"] = ["PS Wise Report"]
br["RadioButtonList2"] = ["English"]
br.submit()
# getting ACs
br.select_form(name="Form1")
br["DistlistP"] = ["1"]
br.submit(name="BtnPs")
# getting PSes
br.select_form(name="Form1")
br["AclistP"] = ["1"]
br.submit(name="BtnPs")
# getting report
br.select_form(name="Form1")
br["PslistP"] = ["1"]
response = br.submit(name="BtnPs")
soup = BeautifulSoup(response)
print(soup.find(id="Pnlfile"))
最后,它会打印"文件"的HTML代码。阻止出现在浏览器的右侧。