我正在尝试抓取此页面(“ http://www.arohan.in/branch-locator.php”),在该页面中,当我选择州和城市时,将显示一个地址,并且我必须在csv / excel文件中写入州,城市和地址。我可以坚持到这一步,现在我被困住了。
这是我的代码:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
chrome_path= r"C:\Users\IBM_ADMIN\Downloads\chromedriver_win32\chromedriver.exe"
driver =webdriver.Chrome(chrome_path)
driver.get("http://www.arohan.in/branch-locator.php")
select = Select(driver.find_element_by_name('state'))
select.select_by_visible_text('Bihar')
drop = Select(driver.find_element_by_name('branch'))
city_option = WebDriverWait(driver, 5).until(lambda x: x.find_element_by_xpath("//select[@id='city1']/option[text()='Gaya']"))
city_option.click()
答案 0 :(得分:2)
硒是必需的吗?看起来您可以使用URL来获得所需的内容:http://www.arohan.in/branch-locator.php?state=Assam&branch=Mirza。
获取状态/分支组合的列表,然后使用精美的汤教程来获取每个页面的信息。
答案 1 :(得分:1)
以一种有条理的方式:
import requests
from bs4 import BeautifulSoup
link = "http://www.arohan.in/branch-locator.php?"
def get_links(session,url,payload):
session.headers["User-Agent"] = "Mozilla/5.0"
res = session.get(url,params=payload)
soup = BeautifulSoup(res.text,"lxml")
item = [item.text for item in soup.select(".address_area p")]
print(item)
if __name__ == '__main__':
for st,br in zip(['Bihar','West Bengal'],['Gaya','Kolkata']):
payload = {
'state':st ,
'branch':br
}
with requests.Session() as session:
get_links(session,link,payload)
输出:
['Branch', 'House no -10/12, Ward-18, Holding No-12, Swarajpuri Road, Near Bank of Baroda, Gaya Pin 823001(Bihar)', 'N/A', 'N/A']
['Head Office', 'PTI Building, 4th Floor, DP Block, DP-9, Salt Lake City Calcutta, 700091', '+91 33 40156000', 'contact@arohan.in']
答案 2 :(得分:0)
更好的方法是避免使用硒。如果您需要呈现HTML所需的javascript处理,这将很有用。在您的情况下,这是不需要的。所需信息已包含在HTML中。
所需的是首先发出请求以获取包含所有状态的页面。然后针对每个状态,请求分支列表。然后,对于每个状态/分支组合,可以发出URL请求以获取包含地址的HTML。这恰好包含在<li>
条目之后的第二个<ul class='address_area'>
条目中:
from bs4 import BeautifulSoup
import requests
import csv
import time
# Get a list of available states
r = requests.get('http://www.arohan.in/branch-locator.php')
soup = BeautifulSoup(r.text, 'html.parser')
state_select = soup.find('select', id='state1')
states = [option.text for option in state_select.find_all('option')[1:]]
# Open an output CSV file
with open('branch addresses.csv', 'w', newline='', encoding='utf-8') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['State', 'Branch', 'Address'])
# For each state determine the available branches
for state in states:
r_branches = requests.post('http://www.arohan.in/Ajax/ajax_branch.php', data={'ajax_state':state})
soup = BeautifulSoup(r_branches.text, 'html.parser')
# For each branch, request a page contain the address
for option in soup.find_all('option')[1:]:
time.sleep(0.5) # Reduce server loading
branch = option.text
print("{}, {}".format(state, branch))
r_branch = requests.get('http://www.arohan.in/branch-locator.php', params={'state':state, 'branch':branch})
soup_branch = BeautifulSoup(r_branch.text, 'html.parser')
ul = soup_branch.find('ul', class_='address_area')
if ul:
address = ul.find_all('li')[1].get_text(strip=True)
row = [state, branch, address]
csv_output.writerow(row)
else:
print(soup_branch.title)
开始为您提供输出CSV文件:
State,Branch,Address
West Bengal,Kolkata,"PTI Building, 4th Floor,DP Block, DP-9, Salt Lake CityCalcutta, 700091"
West Bengal,Maheshtala,"Narmada Park, Par Bangla,Baddir Bandh Bus Stop,Opp Lane Kismat Nungi Road,Maheshtala,Kolkata- 700140. (W.B)"
West Bengal,ShyamBazar,"First Floor, 6 F.b.T. Road,Ward No.-6,Kolkata-700002"
您应该使用time.sleep(0.5)
来降低脚本速度,以避免服务器上的负载过多。
注意:[1:]
用作下拉列表中的第一项不是分支或状态,而是Select Branch
条目。