我正在尝试从一个网站上将数据抓取到CSV文件中,该网站列出了我所在行业的联系人信息。在进入其中一个条目没有特定项目的页面之前,我的代码才能正常工作。
例如:
我正在尝试收集
姓名,电话,个人资料网址
如果没有列出电话号码,则页面上甚至没有该字段的标签,并且我的代码出错了
“ IndexError:列表索引超出范围”
对此我还很陌生,但是到目前为止,我设法从各种youtube教程/这个网站上凑了下来,确实为我节省了很多时间来完成某些工作,而这些工作本来需要几天的时间。任何人愿意提供的帮助,我将不胜感激。
我尝试过更改if / then语句,如果变量为null,则将变量设置为“ Empty”
我更新了代码。我改用CSS选择器以获得更多的特异性和可读性。我还添加了一个try / except来至少绕过索引错误,但是并不能解决由于每个字段的数据量不均匀而导致存储错误数据的问题。另外,我要抓取的网站现在在代码中。
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Firefox()
MAX_PAGE_NUM = 5
MAX_PAGE_DIG = 2
with open('results.csv', 'w') as f:
f.write("Name, Number, URL \n")
#Run Through Pages
for i in range(1, MAX_PAGE_NUM + 1):
page_num = (MAX_PAGE_DIG - len(str(i))) * "0" + str(i)
website = "https://www.realtor.com/realestateagents/lansing_mi/pg-" + page_num
driver.get(website)
Name = driver.find_elements_by_css_selector('div.agent-list-card-title-text.clearfix > div.agent-name.text-bold > a')
Number = driver.find_elements_by_css_selector('div.agent-list-card-title-text.clearfix > div.agent-phone.hidden-xs.hidden-xxs')
URL = driver.find_elements_by_css_selector('div.agent-list-card-title-text.clearfix > div.agent-name.text-bold > a')
#Collect Data From Each Page
num_page_items = len(Name)
with open('results.csv', 'a') as f:
for i in range(num_page_items):
try:
f.write(Name[i].text.replace(",", ".") + "," + Number[i].text + "," + URL[i].get_attribute('href') + "\n")
print(Name[i].text.replace(",", ".") + "," + Number[i].text + "," + URL[i].get_attribute('href') + "\n")
except IndexError:
f.write("Skip, Skip, Skip \n")
print("Number Missing")
continue
driver.close()
如果我要收集的任何字段在单个列表中都不存在,我只希望在电子表格中将空字段填写为“空”。
答案 0 :(得分:1)
您可以使用try / except来解决。我也选择使用Pandas和BeautifulSoup,因为我对它们更加熟悉。
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
import pandas as pd
MAX_PAGE_NUM = 5
MAX_PAGE_DIG = 2
results = pd.DataFrame()
#Run Through Pages
for i in range(1, MAX_PAGE_NUM + 1):
page_num = (MAX_PAGE_DIG - len(str(i))) * "0" + str(i)
website = "https://www.realtor.com/realestateagents/lansing_mi/pg-" + page_num
driver.get(website)
soup = BeautifulSoup(driver.page_source, 'html.parser')
agent_cards = soup.find_all('div', {'class':'agent-list-card clearfix'})
for agent in agent_cards:
try:
Name = agent.find('div', {'itemprop':'name'}).text.strip().split('\n')[0]
except:
Name = None
try:
Number = agent.find('div', {'itemprop':'telephone'}).text.strip()
except:
Number = None
try:
URL = 'https://www.realtor.com/' + agent.find('a', href=True)['href']
except:
URL = None
temp_df = pd.DataFrame([[Name, Number, URL]], columns=['Name','Number','URL'])
results = results.append(temp_df, sort=True).reset_index(drop=True)
print('Processed page: %s' %i)
driver.close()
results.to_csv('results.csv', index=False)
输出:
print (results)
Name ... URL
0 Nicole Enz ... https://www.realtor.com//realestateagents/nico...
1 Jennifer Worthington ... https://www.realtor.com//realestateagents/jenn...
2 Katherine Keener ... https://www.realtor.com//realestateagents/kath...
3 Erica Cook ... https://www.realtor.com//realestateagents/eric...
4 Jeff Thornton, Broker, Assoc Broker ... https://www.realtor.com//realestateagents/jeff...
5 Neal Sanford, Agent ... https://www.realtor.com//realestateagents/neal...
6 Sherree Zea ... https://www.realtor.com//realestateagents/sher...
7 Jennifer Cooper ... https://www.realtor.com//realestateagents/jenn...
8 Charlyn Cosgrove ... https://www.realtor.com//realestateagents/char...
9 Kathy Birchen & Chad Dutcher ... https://www.realtor.com//realestateagents/kath...
10 Nancy Petroff ... https://www.realtor.com//realestateagents/nanc...
11 The Angela Averill Team ... https://www.realtor.com//realestateagents/the-...
12 Christina Tamburino ... https://www.realtor.com//realestateagents/chri...
13 Rayce O'Connell ... https://www.realtor.com//realestateagents/rayc...
14 Stephanie Morey ... https://www.realtor.com//realestateagents/step...
15 Sean Gardner ... https://www.realtor.com//realestateagents/sean...
16 John Burg ... https://www.realtor.com//realestateagents/john...
17 Linda Ellsworth-Moore ... https://www.realtor.com//realestateagents/lind...
18 David Bueche ... https://www.realtor.com//realestateagents/davi...
19 David Ledebuhr ... https://www.realtor.com//realestateagents/davi...
20 Aaron Fox ... https://www.realtor.com//realestateagents/aaro...
21 Kristy Seibold ... https://www.realtor.com//realestateagents/kris...
22 Genia Beckman ... https://www.realtor.com//realestateagents/geni...
23 Angela Bolan ... https://www.realtor.com//realestateagents/ange...
24 Constance Benca ... https://www.realtor.com//realestateagents/cons...
25 Lisa Fata ... https://www.realtor.com//realestateagents/lisa...
26 Mike Dedman ... https://www.realtor.com//realestateagents/mike...
27 Jamie Masarik ... https://www.realtor.com//realestateagents/jami...
28 Amy Yaroch ... https://www.realtor.com//realestateagents/amy-...
29 Debbie McCarthy ... https://www.realtor.com//realestateagents/debb...
.. ... ... ...
70 Vickie Blattner ... https://www.realtor.com//realestateagents/vick...
71 Faith F Steller ... https://www.realtor.com//realestateagents/fait...
72 A. Jason Titus ... https://www.realtor.com//realestateagents/a.--...
73 Matt Bunn ... https://www.realtor.com//realestateagents/matt...
74 Joe Vitale ... https://www.realtor.com//realestateagents/joe-...
75 Reozom Real Estate ... https://www.realtor.com//realestateagents/reoz...
76 Shane Broyles ... https://www.realtor.com//realestateagents/shan...
77 Megan Doyle-Busque ... https://www.realtor.com//realestateagents/mega...
78 Linda Holmes ... https://www.realtor.com//realestateagents/lind...
79 Jeff Burke ... https://www.realtor.com//realestateagents/jeff...
80 Jim Convissor ... https://www.realtor.com//realestateagents/jim-...
81 Concetta D'Agostino ... https://www.realtor.com//realestateagents/conc...
82 Melanie McNamara ... https://www.realtor.com//realestateagents/mela...
83 Julie Adams ... https://www.realtor.com//realestateagents/juli...
84 Liz Horford ... https://www.realtor.com//realestateagents/liz-...
85 Miriam Olsen ... https://www.realtor.com//realestateagents/miri...
86 Wanda Williams ... https://www.realtor.com//realestateagents/wand...
87 Troy Seyfert ... https://www.realtor.com//realestateagents/troy...
88 Maggie Gerich ... https://www.realtor.com//realestateagents/magg...
89 Laura Farhat Bramson ... https://www.realtor.com//realestateagents/laur...
90 Peter MacIntyre ... https://www.realtor.com//realestateagents/pete...
91 Mark Jacobsen ... https://www.realtor.com//realestateagents/mark...
92 Deb Good ... https://www.realtor.com//realestateagents/deb-...
93 Mary Jane Vanderstow ... https://www.realtor.com//realestateagents/mary...
94 Ben Magsig ... https://www.realtor.com//realestateagents/ben-...
95 Brenna Chamberlain ... https://www.realtor.com//realestateagents/bren...
96 Deborah Cooper, CNS ... https://www.realtor.com//realestateagents/debo...
97 Huggler, Bashore & Brooks ... https://www.realtor.com//realestateagents/hugg...
98 Jodey Shepardson Custack ... https://www.realtor.com//realestateagents/jode...
99 Madaline Alspaugh-Young ... https://www.realtor.com//realestateagents/mada...
[100 rows x 3 columns]