现在,我有可以迭代更改URL的代码。然后将URL传递到硒驱动程序中以获取HTML内容。然后将内容放入BeautifulSoup中进行处理。我的问题是我随机收到以下消息(在不同的页面上随机发生,这会导致程序崩溃。没有一致的页面会导致失败):
Traceback (most recent call last):
File "scrape.py", line 89, in <module>
i, i + 5000)
File "scrape.py", line 37, in scrapeWebsite
extractedInfo = info.findAll("td")
AttributeError: 'NoneType' object has no attribute 'findAll'
i,i + 5000用于循环迭代地更新页面,因此并不重要。
以下是进行HTML抓取的代码:
driver = webdriver.Chrome(executable_path='/Users/Downloads/chromedriver')
print(start, stop)
madeDict = {"Date": [], "Team": [], "Name": [], "Relinquished": [], "Notes": []}
#for i in range(0, 214025, 25):
for i in range(start, stop, 25):
print("Current Page: " + str(i))
currUrl = url + str(i)
driver.get(currUrl)
driver.implicitly_wait(100
soupPage = BeautifulSoup(driver.page_source, 'html.parser')
#page = urllib2.urlopen(currUrl)
#soupPage = BeautifulSoup(page, 'html.parser')
# #Sleep the program to ensure page is fully loaded
# time.sleep(1)
info = soupPage.find("table", attrs={'class': 'datatable center'})
extractedInfo = info.findAll("td")
我的猜测是页面未完成加载,因此当它尝试获取内容时,标签可能不存在。但是,我认为Selenium阻止了动态加载网页时出现的问题,以确保在BeautifulSoup抓取信息之前页面已完全加载。我在看其他帖子,有人说我需要等待程序才能动态加载页面,但是我尝试了一下,仍然遇到相同的错误。
答案 0 :(得分:0)
不使用硒执行,而是使用请求。
import requests
from bs4 import BeautifulSoup
url='https://www.prosportstransactions.com/football/Search/SearchResults.php?Player=&Team=&BeginDate=&EndDate=&PlayerMovementChkBx=yes&submit=Search&start='
for i in range(0, 214025, 25):
print("Current Page: " + str(i))
r=requests.get(url + str(i))
soup = BeautifulSoup(r.content)
info = soup.find("table", attrs={'class': 'datatable center'})
extractedInfo = info.findAll("td")
print(extractedInfo)