我目前正在尝试让自己更多地参与编程和python。对于一个小项目,我想为网站构建一个网络爬虫。所以我读到了 scrapy 和 beautifulsoup 到目前为止一直很好..
这是一个简单的网站,带有下拉菜单可用选项。如果我选择其中一个,网站URL不会更改。只有底层的html代码会发生变化。 选择值时,您会得到一个结果表,其中包含一些列/行,其格式为:
<div id="result">
<table class="table">
<thead>
<tr>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
<th>E</th>
</tr>
</thead>
<tbody>
<tr>
<td><b>...</b></td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
...
more follows here...
我想为下拉菜单项的每个获取结果表的所有日期。 到目前为止,我设法只提取下拉菜单值。
如何主动选择下拉菜单中的值,以便网站html代码更改并显示我想要从中获取数据的所需表格? 在阅读了scrapy和beautifulsoup的文档后,我仍然不明白那一部分。
from bs4 import BeautifulSoup
import requests
import urllib3
BASE_URL = "http://routerpasswords.com/"
def get_router_types(url):
r = requests.get(url)
html_content = r.content
soup = BeautifulSoup(html_content)
print("option values: \n")
option_values = soup.find_all("option")
print(option_values)
print(" \n")
print("router types: \n")
router_types = [option.get('value') for option in soup.find_all('option')]
print(router_types)
return router_types
'''
Stuck here!
...
def get_passwords():
router_types = get_router_types(BASE_URL)
passwords = []
for types in router_types:
#print(types)
def main():
get_router_types(BASE_URL)
if __name__ == "__main__":
main()
答案 0 :(得分:0)
每次单击按钮,您都要将数据发布到服务器,您可以在chrome开发工具中找到发布数据(F12):
您可以使用requests
In [27]: data = {'findpass':'1',
...: 'router':'Belkin',
...: 'findpassword':'Find Password'}
In [28]: r = requests.post('http://routerpasswords.com/', data=data)
答案 1 :(得分:0)
首先我在列表中获取所有router_names,
然后为每个路由器做一个具有正确post-params的新请求(def:get_passwords_via_name)
from bs4 import BeautifulSoup
import requests
BASE_URL = "http://routerpasswords.com/"
def get_router_types(url):
r = requests.get(url)
html_content = r.content
soup = BeautifulSoup(html_content)
print("option values: \n")
option_values = soup.find_all("option")
print(option_values)
print(" \n")
print("router types: \n")
router_types = [option.get('value') for option in soup.find_all('option')]
return router_types, r
def get_passwords_via_name(router_name, rcookie):
data = {"findpass": "1", "router": router_name, "findpassword": "Find+Password"}
print data
c = requests.post('http://routerpasswords.com/', data=data)
print c.url
html_content = c.content
print c.status_code
soup = BeautifulSoup(html_content)
return soup.find("div", {"id": "result"})
def main():
rlist, r = get_router_types(BASE_URL)
for i in rlist:
print "debug"
print get_passwords_via_name(i, r)
if __name__ == "__main__":
main()
curl方式:
curl 'http://routerpasswords.com/' --data 'findpass=1&router=ZyXEL&findpassword=Find+Password'