Question

我目前正在写一个python selenium脚本来刮“Likibu.com”，它是一个提供短期住宿的网站，如Airbnb，预订......我已经成功获得了第一页中存在的所有数据并将它们保存在csv文件中，但问题是有37页，我想废弃这些页面中存在的数据。我管理的代码如下：

driver.get("https://www.likibu.com/")
page = driver.page_source
soup = BeautifulSoup(page, "lxml")
driver.get("https://www.likibu.com/{0}".format(soup.find(rel=re.compile("nofollow")).attrs["href"]))

您可以在此处找到网页的源代码：

<ul class="pagination">
<li class="disabled"><a href="#">«</a></li>
<li class="active"><a class="" rel="nofollow" href="https://www.likibu.com/fr/search/39tuzgbpnycdv7tkj102g?guests=2&amp;destination_id=4094&amp;page=1">1</a></li>
<li><a class="" rel="nofollow" href="https://www.likibu.com/fr/search/39tuzgbpnycdv7tkj102g?guests=2&amp;destination_id=4094&amp;page=37">37</a></li>
<li><a class="" rel="nofollow" href="https://www.likibu.com/fr/search/39tuzgbpnycdv7tkj102g?guests=2&amp;destination_id=4094&amp;page=2">»</a></li>

Answer 1

任何时候你要删除多个页面，你必须弄清楚网址是如何变化的。在你的情况下：

root = 'https://www.likibu.com/fr/search/39yrzgbpnycdv7tkj132g?guests=2&page='

page_number = 0
while true:
    page_number +=1
    try: 
        url = root + str(page_number)
        ### CODE #####
    except:
        ### terminare / print something ####

注意：我在您发布的链接中添加了“＆amp; page =”。尽管如此，它并没有显示在第一页的网址中。它仍然退出。你添加'＆amp; page = 1'，它会给玩具第一页。

Answer 2

我在使用boucle时修复了这个问题而真：

    if not driver.find_elements_by_xpath("//*[contains(text(), 'Suivant')]"):
        break
    link=WebDriverWait(driver, 1530).until(expected_conditions.element_to_be_clickable((By.LINK_TEXT, "Suivant")))
    link.click()
    next_page = driver.find_element_by_css_selector('#pnnext')
    next_page.click()
    time.sleep(5)"""

Automat使用selenium和python访问下一页

2 个答案: