Question

我有一个需要用selenium打开，执行脚本并提取某些链接的URL列表。

我到目前为止所做的：

import re
from selenium import webdriver

###  Variables  ###

regexp = re.compile(r'\.[\.a-z]?[\.a-z]?\/')


###  Function  ###

def get_links():

    driver = webdriver.Firefox(executable_path=r'/usr/local/bin/geckodriver')
    urls = ['https://www.url1.com.gt/', 'https://www.url2.com.co/', 'https://www.url3.com.pe']

    for url in urls:

        links = []
        target = []

        country = re.search(regexp, url).group()

        driver.get(url)
        driver.execute_script('return document.documentElement.outerHTML')

        hrefs = driver.find_elements_by_xpath('//a[@href]')

        for href in hrefs:

            links.append(href.get_attribute('href'))

        for link in links:

            if 'string to check' in link:
                target.append(link)

        return country, target


country, target = get_links()
df = {country: target}
print(df)

预期输出是带有键：国家/地区和值：匹配的链接的字典。

当我运行这段代码时，它可以正确执行，但不会遍历URL列表，它只会打开并返回第一个URL的数据。

如果我在urls循环中将return语句放在for url之外，它将返回第三个URL的数据。

如何获取列表中所有URL的信息？

Answer 1

看起来return语句是在for循环中定义的。结果，由于该函数在第一个循环之后退出，因此您只能获得第一个URL。也就是说，您是否通过在for循环外定义target和link并取消缩进return county, target使其在主要for循环之外来尝试@Andrex的建议？因此，最终代码应类似于：

def get_links():
    [SOME CODE]

    data = {}

    for url in urls:
        links = []
        target = []

        [SOME CODE]

        data[country] = target

    return data # Unindented

希望这段经过编辑的代码可以帮助您获得所需的结果。

遍历URL列表并使用Selenium打开每个URL

1 个答案: