我正在尝试根据具有多个下拉列表的网页中的下拉列表中的值来抓取一个表(它需要登录,因此无法在此处发布)。
共有三个下拉列表:state
,muni
和year
。因此,我要遍历并刮擦的表非常多:state * muni * year
。
我想遍历州(第一),获得市政(第一),以及所有这些年。
然后在相同的状态(第一个)上,获取下一个市政(第二个),并从所有年份中抓取表格:
state(1), muni(1), year(all)
state(1), muni(2), year(all)
...
state(last), muni(last), year(all)
伪代码:
for i in each unique state:
select each muni
for j in each muni:
scrape each table from each year j in a year list
append the year list in the muni list in a state list
到目前为止,我已经这样做了,但是它一直在第一个州和市政上永久地重复岁月,但是并没有移到下一个州。您对我如何解决此问题有任何提示吗?任何帮助表示赞赏。
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
muni = []
year = []
data = []
for i in state:
select_state = Select(browser.find_element_by_class_name("lists-landingpage--navigation-regionSelector"))
select_state.select_by_value(i)
options_muni = browser.find_element_by_class_name("lists-landingpage--navigation-subRegionSelector")
options_muni = options_muni.find_elements_by_tag_name('option')
for j in options_muni:
muni.append(j.get_attribute("value"))
for k in muni:
select_muni = Select(browser.find_element_by_class_name("lists-landingpage--navigation-subRegionSelector"))
select_muni.select_by_value(k)
options_year = browser.find_element_by_class_name("lists-landingpage--navigation-yearSelector")
options_year = options_year.find_elements_by_tag_name('option')
for n in options_year:
year.append(n.get_attribute("value"))
table = soup.find('div', attrs = {'class': 'lists-landingpage--body'})
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
如何将它们附加在列表(year
)的列表(muni
)的列表(state
)中?
答案 0 :(得分:2)
您可以将其变成词典列表
all_data = []
for i in state:
for j in options_muni:
values = {'state': i, 'muni': j, 'years': []}
for n in options_year:
values['years'].append(options_year)
示例:
states = ['state1', 'state2', 'state3']
munis = ['muni1', 'muni2', 'muni3']
years = ['year1', 'year2', 'year3']
将输出
{'state': 'state1', 'muni': 'muni1', 'years': ['year1', 'year2', 'year3']}
{'state': 'state1', 'muni': 'muni2', 'years': ['year1', 'year2', 'year3']}
{'state': 'state1', 'muni': 'muni3', 'years': ['year1', 'year2', 'year3']}
{'state': 'state2', 'muni': 'muni1', 'years': ['year1', 'year2', 'year3']}
{'state': 'state2', 'muni': 'muni2', 'years': ['year1', 'year2', 'year3']}
{'state': 'state2', 'muni': 'muni3', 'years': ['year1', 'year2', 'year3']}
{'state': 'state3', 'muni': 'muni1', 'years': ['year1', 'year2', 'year3']}
{'state': 'state3', 'muni': 'muni2', 'years': ['year1', 'year2', 'year3']}
{'state': 'state3', 'muni': 'muni3', 'years': ['year1', 'year2', 'year3']}