Web抓取和保存在csv文件中

时间:2016-12-15 20:40:55

标签: python python-2.7 csv selenium-webdriver web-scraping

我正在使用Selenium并且自动化部件正在高效工作,但数据正在csv中不准确地保存。即使我的f(csv文件)中有四个地址,它也只返回冗余列出的第一个地址中的数据。它会在csv文件中反复返回第一个地址的数据。我怎么能告诉Python只有一个标题用于所有列,而不是Permit,Address,Street Name等...每次迭代过程。如果您有任何需要进一步的细节,请告诉我。

import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv


driver = webdriver.Chrome("C:\Python27\Scripts\chromedriver.exe")
chrome = driver.get('https://etrakit.friscotexas.gov/Search/permit.aspx')


wait = WebDriverWait(driver, 10)

with open('C:/Users/list.csv','r') as f:
addresses = f.readlines()

for address in addresses:
    driver.find_element_by_css_selector('#cplMain_txtSearchString').clear()       
    driver.find_element_by_css_selector('#cplMain_txtSearchString').send_keys(address)
    driver.find_element_by_css_selector('#cplMain_btnSearch').click()

    table = wait.until(EC.visibility_of_element_located((By.ID, "ctl00_cplMain_rgSearchRslts_ctl00")))

    df = pd.read_html(table.get_attribute("outerHTML"))[0]

    with open('thematchingresults.csv', 'a') as f:
        df.to_csv(f)

我要解析的四个地址:

6525 Mountain Sky Rd
6543 Mountain Sky Rd
6561 Mountain Sky Rd
6579 Mountain Sky Rd

如何将数据输入csv文件:

Permit Number   Address                  Street Name     Applicant Name     Contractor Name SITE_SUBDIVISION    RECORDID
0   B13-2169    6525 MOUNTAIN SKY RD    MOUNTAIN SKY RD SHADDOCK HOMES LTD  SHADDOCK HOMES LTD  PCR - SHERIDAN  MAC:1306181017281473
1   L13-3451    6525 MOUNTAIN SKY RD    MOUNTAIN SKY RD TDS IRRIGATION  TDS IRRIGATION  SHERIDAN    ECON:131115094522681
2   ROW13-6260  6525 Mountain Sky Rd    Mountain Sky Rd AT&T Broadband & Internet Serv  Housley Group       SSW:1312030140165722
Permit Number   Address                   Street Name       Applicant Name   Contractor Name    SITE_SUBDIVISION    RECORDID
0   B13-2169    6525 MOUNTAIN SKY RD    MOUNTAIN SKY RD SHADDOCK HOMES LTD  SHADDOCK HOMES LTD  PCR - SHERIDAN  MAC:1306181017281473
1   L13-3451    6525 MOUNTAIN SKY RD    MOUNTAIN SKY RD TDS IRRIGATION  TDS IRRIGATION  SHERIDAN    ECON:131115094522681
2   ROW13-6260  6525 Mountain Sky Rd    Mountain Sky Rd AT&T Broadband & Internet Serv  Housley Group       SSW:1312030140165722

1 个答案:

答案 0 :(得分:1)

您的代码几乎完美无缺,但在页面有机会更新之前,您的wait.until()似乎很满意。只需在wait.until()之前添加一秒延迟就行了,尽管您需要调查更严格的方法:

time.sleep(2)

这给了我以下CSV输出文件:

,Permit Number,Address,Street Name,Applicant Name,Contractor Name,SITE_SUBDIVISION,RECORDID
0,B13-2169,6525 MOUNTAIN SKY RD,MOUNTAIN SKY RD,SHADDOCK HOMES LTD,SHADDOCK HOMES LTD,PCR - SHERIDAN,MAC:1306181017281473
1,L13-3451,6525 MOUNTAIN SKY RD,MOUNTAIN SKY RD,TDS IRRIGATION,TDS IRRIGATION,SHERIDAN,ECON:131115094522681
2,ROW13-6260,6525 Mountain Sky Rd,Mountain Sky Rd,AT&T Broadband & Internet Serv,Housley Group,,SSW:1312030140165722
,Permit Number,Address,Street Name,Applicant Name,Contractor Name,SITE_SUBDIVISION,RECORDID
0,B14-0771,6543 MOUNTAIN SKY RD,MOUNTAIN SKY RD,DREES CUSTOM HOMES,DREES CUSTOM HOMES,PCR - SHERIDAN,LWE:1403121043033654
1,L14-2401,6543 MOUNTAIN SKY RD,MOUNTAIN SKY RD,DFW SITE DESIGN,DFW SITE DESIGN,SHERIDAN,ECON:140711080345627
2,ROW15-4097,6543 MOUNTAIN SKY RD,MOUNTAIN SKY RD,HOUSLEY GROUP,HOUSLEY GROUP,,TLW:1507220204411002
,Permit Number,Address,Street Name,Applicant Name,Contractor Name,SITE_SUBDIVISION,RECORDID
0,B13-2364,6561 MOUNTAIN SKY RD,MOUNTAIN SKY RD,DREES CUSTOM HOMES,DREES CUSTOM HOMES,PCR - SHERIDAN,MAC:1307030929232194
1,L14-1500,6561 MOUNTAIN SKY RD,MOUNTAIN SKY RD,DFW SITE DESIGN,DFW SITE DESIGN,SHERIDAN,ECON:140424040055127
2,P15-0073,6561 MOUNTAIN SKY RD,MOUNTAIN SKY RD,RIVERBEND/SANDLER POOLS,,SHERIDAN,HC:1502160438345148
,Permit Number,Address,Street Name,Applicant Name,Contractor Name,SITE_SUBDIVISION,RECORDID
0,B13-2809,6579 MOUNTAIN SKY RD,MOUNTAIN SKY RD,SHADDOCK HOMES LTD,SHADDOCK HOMES LTD,PCR - SHERIDAN,MAC:1308050328358768
1,B13-4096,6579 MOUNTAIN SKY RD,MOUNTAIN SKY RD,MIRAGE CUSTOM POOLS,MIRAGE CUSTOM POOLS,PCR - SHERIDAN,MAC:1312030307087756
2,L14-1640,6579 MOUNTAIN SKY RD,MOUNTAIN SKY RD,TDS IRRIGATION,TDS IRRIGATION,SHERIDAN,ECON:140506012624706
3,P14-0018,6579 MOUNTAIN SKY RD,MOUNTAIN SKY RD,MIRAGE CUSTOM POOLS,,SHERIDAN,LCR:1401130949212891
4,ROW14-3205,6579 MOUNTAIN SKY RD,MOUNTAIN SKY RD,Housley Group,Housley Group,,TLW:1406190424422330

作为替代方法,您可以继续轮询表,直到您看到已加载新数据:

import pandas as pd
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv


driver = webdriver.Chrome(r"C:\Python27\chromedriver.exe")
chrome = driver.get('https://etrakit.friscotexas.gov/Search/permit.aspx')
wait = WebDriverWait(driver, 10)

with open('C:/Users/list.csv','r') as f:
    addresses = f.readlines()

old_table_html = []

for address in addresses:
    print address
    driver.find_element_by_css_selector('#cplMain_txtSearchString').clear()       
    driver.find_element_by_css_selector('#cplMain_txtSearchString').send_keys(address)
    driver.find_element_by_css_selector('#cplMain_btnSearch').click()

    while True:
        try:
            table = wait.until(EC.visibility_of_element_located((By.ID, "ctl00_cplMain_rgSearchRslts_ctl00")))
            table_html = table.get_attribute("outerHTML")

            if table_html != old_table_html:
                break
        except selenium.common.exceptions.StaleElementReferenceException:
            pass

    old_table_html = table_html
    df = pd.read_html(table_html)[0]

    with open('thematchingresults.csv', 'a') as f:
        df.to_csv(f)