使用Selenium Python生成URL列表

时间:2017-03-07 10:47:01

标签: python selenium

我正在尝试使用Selenium生成网址列表。 我希望用户浏览已检测的浏览器,最后创建一个他访问过的URL列表。

我发现属性“current_url”有助于做到这一点,但我没有找到方法知道用户点击了链接。

In [117]: from selenium import webdriver

In [118]: browser = webdriver.Chrome()

In [119]: browser.get("http://stackoverflow.com")

--> here, I click on the "Questions" link.

In [120]: browser.current_url

Out[120]: 'http://stackoverflow.com/questions'

--> here, I click on the "Jobs" link.

In [121]: browser.current_url

Out[121]: 'http://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab'

任何暗示都赞赏!

谢谢,

1 个答案:

答案 0 :(得分:2)

目前还没有一种官方的方法可以监控用户在Selenium中的行为。你唯一能做的就是启动驱动程序,然后运行一个不断检查driver.current_url的循环。但是,我不知道退出此循环的最佳方法是什么,因为我不知道您的用法是什么。也许尝试类似的事情:

from selenium import webdriver


urls = []

driver = webdriver.Firefox()

current = 'http://www.google.com'
driver.get('http://www.google.com')
while True:
    if driver.current_url != current:
        current = driver.current_url

        # if you want to capture every URL, including duplicates:
        urls.append(current)

        # or if you only want to capture unique URLs:
        if current not in urls:
            urls.append(current)

如果您对如何结束此循环一无所知,我建议用户导航到会破坏循环的网址,例如http://www.endseleniumcheck.com并将其添加到代码中:

from selenium import webdriver


urls = []

driver = webdriver.Firefox()

current = 'http://www.google.com'
driver.get('http://www.google.com')
while True:
    if driver.current_url == 'http://www.endseleniumcheck.com':
        break

    if driver.current_url != current:
        current = driver.current_url

        # if you want to capture every URL, including duplicates:
        urls.append(current)

        # or if you only want to capture unique URLs:
        if current not in urls:
            urls.append(current)

或者,如果你想变得狡猾,你可以在用户退出浏览器时终止循环。您可以通过使用psutil库(pip install psutil)监控进程ID来执行此操作:

from selenium import webdriver
import psutil


urls = []

driver = webdriver.Firefox()
pid = driver.binary.process.pid

current = 'http://www.google.com'
driver.get('http://www.google.com')
while True:
    if pid not in psutil.pids():
        break

    if driver.current_url != current:
        current = driver.current_url

        # if you want to capture every URL, including duplicates:
        urls.append(current)

        # or if you only want to capture unique URLs:
        if current not in urls:
            urls.append(current)