使Selenium从.txt文件获取URL列表

时间:2019-03-29 23:16:28

标签: python linux selenium

我有一个返回URL列表标题的代码。我想用几种方法充实它。

这是代码:

from pyvirtualdisplay import Display
from time import sleep
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.firefox.options import Options
display = Display(visible=0, size(800,600))
display.start()
urls = ["https://google.com", "https://youtube.com"]
driver = webdriver.Firefox(executable_path='/usr/local/lib/geckodriver/geckodriver')
driver.set_page_load_timeout(60)
for url in urls:
        try:
           driver.get(url)
           print(driver.title)
        except TimeoutException as e:
           print("Timeout")
driver.quit()

有了这个,我想做以下事情。首先,我不希望像这样获取网址列表,而希望从.txt中获取它们。 然后,我也想要它,以便当它检查单个URL时,它等待其标题从“正在加载...”转到其他内容,然后打印更改后的内容。为此,我已经尝试过:

while driver.title == 'Loading...':  
     pass
print(driver.title)

这里的问题是,有时标题永远不会从“正在加载...”更改,因此程序将永远停留在该位置。我想拥有它,以便在10秒钟后仍未更改的情况下,在打印“标题未加载”之后,它将转到列表中的下一个URL。

我还要补充一点,我不确定该怎么做。用“ print(driver.title)”打印标题。我想在标题后面添加一个数字(“ print(driver.title),“ number”)。 这个数字背后的原因是要知道到目前为止已经经过了多少个URL,但是它不是从1开始。它会以一个较大的数字开始,例如50。这意味着在第5个URL上应该是“网址标题,55。”我该怎么办?

谢谢。

2 个答案:

答案 0 :(得分:1)

如果标题未更改,则能够在10秒后超时,我可以为您提供与Java一起使用的功能。我知道您正在使用python,但这是我必须显示的内容。您应该能够切换出适当的python语法

def timeExpired = false

def timeoutPeriod = new TimeDuration(0, 0, 10, 0)

def timeStart = new Date()

def titleFound = false

def title

while(!titleFound && !timeExpired){ //While title is not found AND time has not expired run loop

    try{

        title = driver.title

        titleFound = title != "Loading..."

        if(!titleFound){  //No need to check if time expired if title has been found, so only checking if it hasn't

            timeExpired = TimeCategory.minus(new Date(), timeStart) > timeoutPeriod

            if(timeExpired){

                title = "Title didn't load"
            }
        }                   
    }
    catch(Exception e){

        //Handle the exception
    }
}

print(title)

要从文本文件中输入URL,请用逗号分隔URL并读取内容:

text_file = open("filename.txt", "r")
lines = text_file.read().split(',')

我没有启动并运行python来确认这是正确的,但是您可以循环浏览各行并以这种方式传递url并像已经做的那样浏览驱动程序。

最后,要在打印输出中添加一个计数器,您只需要设置一个计数器变量,然后再开始遍历URL即可以所需的任意数字开始。

counter = 50

然后在循环中,每次将增加1:

counter += 1

要将其添加到打印输出中,可以执行以下操作:

print(title + " " + str(counter))

语法可能并不完美,但应该接近。

答案 1 :(得分:1)

这是更新的脚本,其中包含您的要求。

from pyvirtualdisplay import Display
import time
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.firefox.options import Options
from datetime import datetime

# this method will check the driver title after the specified interval seconds for a given max time in seconds
def wait_until_browser_loaded(interval, maxTime):
    start_time = datetime.now()
    elements = []
    while (datetime.now() - start_time).seconds < maxTime:
        time.sleep(interval)
        if driver.title != 'Loading...':
            return

display = Display(visible=0, size(800,600))
display.start()
# open and readlines from external input file
urlsFile = open("urls_file_path_goes_here", "r")
urls = urlsFile.readlines() # use this if you want to enter urls in different lines
#urls = urlsFile.read().split(",") # use this if you want to enter comma separated urls.

driver = webdriver.Firefox(executable_path='/usr/local/lib/geckodriver/geckodriver')
driver.set_page_load_timeout(60)
titleAppendNumber = 50
for url in urls:
        try:
            driver.get(url)

            title = driver.title
            if title == "Loading...":
                wait_until_browser_loaded(5, 10)
            if title == "Loading...":
                print ("Title Load" + " - " + str(titleAppendNumber))
            else:
                print (title + " - "+ str(titleAppendNumber))
            titleAppendNumber +=1
        except TimeoutException as e:
           print("Timeout")
driver.quit()