我的代码目前在抓取过程中创建了以下输出:https://pastebin.com/pUcCdbMn。
我想在listing-title
中获取文本,即
<h2 class="listing-title"><a class="listing-fpa-link" href="...">Vauxhall Astra 1.6i 16V Design 5dr Hatchback</a></h2>
返回沃克斯豪尔Astra 1.6i 16V设计5dr两厢
listing-key-specs
,即
<ul class="listing-key-specs">
<li>2015
(65 reg)</li>
<li>Hatchback</li>
<li>14,304 miles</li>
<li>Manual</li>
<li>1.6L</li>
<li>Petrol</li>
</ul>
将 2015(65 reg),两厢车,&#34; 14,304英里&#34;,手动,1.6L,汽油全部作为单独变量返回。
我怎样才能做到这一点?当我尝试提取列表标题时,我的代码当前返回None
:
for page in range(1, 3):
page_count = str(page)
if page is 1:
url = "http://www.autotrader.co.uk/car-search?sort=sponsored&radius=1500&postcode=se218qe&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New"
else:
url = "http://www.autotrader.co.uk/car-search?sort=sponsored&radius=1500&postcode=se218qe&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&page=" + page_count
browser.get(url)
soup = BeautifulSoup(browser.page_source, "html.parser")
cars = soup.find_all('li', {'class': 'search-page__result'})
cars_count = len(cars)
print 'Processing ' + str(cars_count) + ' cars found on page ' + page_count
# Loop through cars on page
for car in cars:
car_name = car.find('h2 ', {'class': 'listing-title'})
print car_name
答案 0 :(得分:3)
您在标记名称后面有这个额外的空格:
car_name = car.find('h2 ', {'class': 'listing-title'})
# HERE^
删除它,它应该按原样开始工作。
请注意,要获取标题文本,请使用get_text()
方法:
print(car_name.get_text(strip=True))
您也可以将.find()
替换为.select_one()
:
car_name = car.find('h2.listing-title')
我还会让脚本更可靠,explicitly wait使搜索结果在读取页面源并传递给它以进一步解析之前出现:
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
# ...
browser.get(url)
wait = WebDriverWait(browser, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".search-page__result .listing-title")))
soup = BeautifulSoup(browser.page_source, "html.parser")