Question

我正在努力抓取网站上文章的链接。但是通常在加载网站时，它仅列出5条文章，然后需要单击“加载更多”按钮以显示更多文章列表。 HTML来源只有前五篇文章的链接。

我使用了selenium python来自动单击“加载更多”按钮，以完全加载包含所有文章列表的网页。

现在的问题是如何提取所有这些文章的链接。

用硒完全加载网站后，我尝试使用driver.page_source获取html源并将其打印出来，但仍然仅链接到前5篇文章。

我想获得指向单击“加载更多”按钮后在网页中加载的所有那些文章的链接。

请帮助提供解决方案的人。

Answer 1

也许链接需要一些时间才能显示出来，并且您的代码在更新源代码之前正在执行driver.source_code。您可以在明确等待后选择带有Selenium的链接，以确保完全添加到网页上的链接已经完全加载。没有链接到源代码很难精确地找到所需的东西，但是（在Python中）它应该类似于：

from selenium.webdriver.support.ui import WebDriverWait

def condition(driver):
    """If the selector defined in the function retrieves 10 or more results, return the results.
    Else, return None.
    """
    selector = 'a.my_class' # Selects all <a> tags with the class "my_class" 
    els = driver.find_elements_by_css_selector(selector)
    if len(els) >= 10:
        return els

# Making an assignment only when the condition returns a truthy value when called (waiting until 2 min):
links_elements = WebDriverWait(driver, timeout=120).until(condition)
# Getting the href attribute of the links 
links_href = [link.get_attribute('href') for link in links_elements]

在此代码中，您是：

不断寻找所需的元素，直到有10个或更多。您可以通过CSS选择器（如示例中），XPath或other method来执行此操作。一旦wait条件返回具有True值的对象，直到出现超时，这将为您提供Selenium对象的列表。 See more on explicit waits in the documentation。您应该为您的情况制定适当的条件-如果不确定不确定最后会有多少个链接，也许期望某些链接不是很好。
从Selenium对象中提取所需的内容。为此，请对从上一步获得的列表中的元素使用适当的方法。

硒刮链

1 个答案: