如何在Python中使用selenium(并关闭旧版)打开新网页?

时间:2017-07-26 07:01:43

标签: python selenium web-scraping geckodriver

我想在Python 2.7中使用selenium的网站进行一些网络抓取,稍等一下,关闭浏览器后关闭geckodriver.exe(因为我不想打开数百万的浏览器页面和.exe文件)

我有什么方法可以做到这一点吗?

我的评论代码:



from bs4 import BeautifulSoup
from selenium import webdriver
import time
import urllib2
import unicodecsv as csv
import os
import sys
import io
import time
import datetime
import pandas as pd
from bs4 import BeautifulSoup
import MySQLdb
import re
import contextlib
import selenium.webdriver.support.ui as ui

#I am create a new csv file
filename=r'output.csv'

resultcsv=open(filename,"wb")
output=csv.writer(resultcsv, delimiter=';',quotechar = '"', quoting=csv.QUOTE_NONNUMERIC, encoding='latin-1')

#I am opening the website with selenium (js website)
profile=webdriver.FirefoxProfile()
profile.set_preference("intl.accept_languages","en-us")
driver = webdriver.Firefox(firefox_profile=profile)
driver.get("https://www.flightradar24.com/data/airports/bud/arrivals")
time.sleep(10)
html_source=driver.page_source
soup=BeautifulSoup(html_source,"html.parser")
print soup

#HERE I AM WEBSCRAPING THE INFORMATIONS WHAT I NEEDED AND 
#AFTER I AM WRITING IT INTO THIS CSV FILE.
 
output.writerows(datatable)
 
resultcsv.close()

#AND MY QUESTION START HERE. I WANT TO CLOSE THIS SESSEION,
#WAIT A LITTLE, FOR EXAMPLE 10 SEC, BECAUSE IT IS NEEDED SOME TIME TO WEB-
#SCRAPING DATAES AFTER THIS CLOSE THE GECKODRIVER + FIREFOX, AND AFTER 
#REPEAT THIS CODE WITH A NEW WEBSITE. IS IT POSSIBLE?




更新代码 - nutmeg64

我收到此错误消息:

  

文件" C:/Python27/air17.py" ;,第43行,in       刮(网址)     文件" C:/Python27/air17.py",第28行,在刮       table = soup.find(' table',{" class":" table table-condensed table-hover data-table m-n-t-15"})   NameError:全球名称'汤'未定义



from bs4 import BeautifulSoup
from selenium import webdriver
import time
import urllib2
import unicodecsv as csv
import os
import sys
import io
import time
import datetime
import pandas as pd
from bs4 import BeautifulSoup
import MySQLdb
import re
import contextlib
import selenium.webdriver.support.ui as ui

filename=r'output.csv'

resultcsv=open(filename,"wb")
output=csv.writer(resultcsv, delimiter=';',quotechar = '"', quoting=csv.QUOTE_NONNUMERIC, encoding='latin-1')

def scrape(urls):
    browser = webdriver.Firefox()
    for url in urls:
        browser.get(url)
        html = browser.page_source
        soup=BeautifulSoup(html,"html.parser")
        table = soup.find('table', { "class" : "table table-condensed table-hover data-table m-n-t-15" })
        datatable=[]
        for record in table.find_all('tr', class_="hidden-xs hidden-sm ng-scope"):
            temp_data = []
            for data in record.find_all("td"):
                temp_data.append(data.text.encode('latin-1'))
            datatable.append(temp_data)
 
        output.writerows(datatable)
 
        resultcsv.close()
        time.sleep(10) 
        browser.quit()

urls = ["https://www.flightradar24.com/data/airports/bud/arrivals", "https://www.flightradar24.com/data/airports/fco/arrivals"]
scrape(urls)
 




1 个答案:

答案 0 :(得分:0)

将selenium部分放入函数中,并使用不同的url调用它。在迭代之间睡10秒钟。

BTW这不是理想的解决方案。您只需打开一次硒,阅读来源,然后browser.get(new_url)。在完成所有抓取之后,执行browser.quit()以释放。

例如(非常非常简化):

def scrape(urls):
    browser = webdriver.Firefox()
    for url in urls:
        browser.get(url)
        html = browser.page_source
        # scrape the html as you like
        # create a csv file for that specific url
        # write results to csv and close it
        time.sleep(10) # <-- not really necessary. scraping and writing to csv is a long enough break
    browser.quit()

urls = ["http://example.com", "http://notarealwebsite.co.uk", "http://lastwebpagetoscrape.com" ]
scrape(urls)