在我抓取时,该页面是动态的,带有“加载更多”按钮。 我用硒。 第一个问题是它只能运行一次。表示仅第一次单击加载更多按钮。 第二个问题是它只抓取第一个“加载更多”按钮之前的商品。在那之后不刮。 第三个问题是它会将所有文章刮掉两次。 第四个问题是我只想要日期,但同时提供日期,作者和地点。
import time
import requests
from bs4 import BeautifulSoup
from bs4.element import Tag
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
base = "https://indianexpress.com"
browser = webdriver.Safari(executable_path='/usr/bin/safaridriver')
wait = WebDriverWait(browser, 10)
browser.get('https://indianexpress.com/?s=cybersecurity')
while True:
try:
time.sleep(6)
show_more = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'Load More')))
show_more.click()
except Exception as e:
print(e)
break
soup = BeautifulSoup(browser.page_source,'lxml')
search_results = soup.find('div', {'id':'ie-infinite-scroll'})
links = search_results.find_all('a')
for link in links:
link_url = link['href']
response = requests.get(link_url)
sauce = BeautifulSoup(response.text, 'html.parser')
dateTag = sauce.find('div', {'class':'m-story-meta__credit'})
titleTag = sauce.find('h1', {'class':'m-story-header__title'})
contentTag = ' '.join([item.get_text(strip=True) for item in sauce.select("[class^='o-story-content__main a-wysiwyg'] p")])
date = None
title = None
content = None
if isinstance(dateTag, Tag):
date = dateTag.get_text().strip()
if isinstance(titleTag, Tag):
title = titleTag.get_text().strip()
print(f'{date}\n {title}\n {contentTag}\n')
time.sleep(3)
此代码没有错误。但是它需要完善。解决上述问题该怎么办?
谢谢。
答案 0 :(得分:1)
因为您没有在等待新内容。在等待加载新内容时,您尝试单击“加载更多”按钮。
错误消息:
Message: Element <a class="m-featured-link m-featured-link--centered ie-load-more" href="#"> is not clickable at point (467,417) because another element <div class="o-listing__load-more m-loading"> obscures it
我的解决方案:
while True:
try:
wait.until(EC.element_to_be_clickable((By.XPATH, "//a[contains(@class, 'ie-load-more')]")))
browser.find_element_by_xpath("//a[contains(@class, 'ie-load-more')]").click()
wait.until(EC.visibility_of_element_located((By.XPATH,"//div[@class='o-listing__load-more']")))
except Exception as e:
print(e)
break