用硒进行网络抓图以单击按钮并抓住所有内容

时间:2020-09-25 21:12:52

标签: python selenium-webdriver web-scraping

我已经在这个刮板上工作了一段时间了,我认为它可以改进,但是我不确定从这里去哪里。

最初的刮板看起来像这样,我相信它可以完成我需要做的所有事情:

                                                                                                                                                        
url = "https://matrix.heartlandmls.com/Matrix/Public/Portal.aspx?L=1&k=990316X949Z&p=DE-74613894-421"   

h_table = []
driver = webdriver.Firefox()
driver.get(url)
driver.find_element_by_xpath("/html/body/form/div[3]/div/div/div[5]/div[3]/span[2]/div/div/div[2]/div[1]/div/div/div[2]/div[2]/div[1]/span/a").click()
time.sleep(10)
i = 200
while i > 0:
    h_table.append(driver.find_element_by_id("wrapperTable").text)
    driver.find_element_by_xpath("/html/body/form/div[3]/div/div/div[5]/div[2]/div/div[1]/div/div/span/ul/li[2]/a").click()
    time.sleep(10)
    i -= 1

这会将所有内容输出到我可以清理的表中

['210 Sitter Street\nPleasant Hill, MO 64080\nMLS#:2178982\nMap\n$15,000\nSold\n4Bedrms\n2Full Bath(s)\n0Half Bath(s)\n1,848Sqft\nBuilt in1950\n0.27Acres\nSingle Family\n1 / 10\nThis Home sits on a level, treed, and nice .279 acre sizeable double lot. The property per taxes, is identified as a Single Family Home however it has 2 separate utility meters and 2 living spaces, each with 2 bedrooms and 1 full bath and laundry areas, and was utilized as a Duplex for Rental income for 2 units. This property is a CASH ONLY sale and is being sold "In It\'s Present Condition". Home and detached garage are in need of repair OR would be a candidate for a tear down and complete rebuild on the lot.\nAbout 210 Sitter Street, Pleasant Hill, MO 64080\nDirections:I-70 to 7 Hwy, to Broadway, to Sitter St, to property.\nGeneral Description\nMLS Number\n2178982\nCounty\nCass\nCity\nPleasant Hill\nSub Div\nWalkers & Sitlers\nType\nSingle Family\nFloor Plan Description\nRanch\nBdrms\n4\nBaths Full\n2\nBaths Half\n0\nAge Description\n51-75 Years\nYear Built\n1950\nSqft Main\n1848\nSQFT MAIN SOURCE\nPublic Record\nBelow Grade Finished Sq Ft\n0\nBelow Grade Finished Sq Ft Source\nPublic Record\nSqft\n1848\nLot Size\n12,155\nAcres\n0.27\nSchools E\nPleasant Hill Prim\nSchools M\nPleasant Hill\nSchools H\nPleasant Hill\nSchool District\nPleasant Hill\nLegal Description\nWALKER & SITLERS LOT 47 & 48 BLK 5\nS Terms\nCash\nInterior Features\nFireplace?\nY\nFireplace Description\nLiving Room, Wood Burning\nBasement\nN\nBasement Description\nBlock, Crawl Space\nDining Area Description\nEat-In Kitchen\nUtility Room\nMultiple, Main Level\nInterior Features\nFixer Up\nRooms\nBathroom Full\nLevel 1\n2nd Full Bath\nLevel 1\nMaster Bedroom\nLevel 1\nSecond Bedroom\nLevel 1\nMaster BR- 2nd\nLevel 1\nFourth Bedroom\nLevel 1\nKitchen\nLevel 1\nKitchen- 2nd\nLevel 1\nLiving Room\nLevel 1\nFamily Rm- 2nd\nLevel 1\nExterior / Construction\nGarage/Parking?\nY\nGarage/Parking #\n2\nGarage Description\nDetached, Front Entry\nConstruction\nFrame\nArchitecture\nTraditional\nRoof\nComposition\nLot Description\nCity Limits, City Lot, Level, Treed\nIn Floodplain\nNo\nInside City Limits\nYes\nStreet Maintenance\nPub Maint, Paved\nExterior Features\nFixer Up\nUtility Information\nCentral Air\nY\nHeat\nForced Air Gas\nCool\nCentral Electric, Window Unit(s)\nWater\nCity/Public\nSewer\nCity/Public\nFinancial Information\nS Terms\nCash\nHoa Amount\n$0\nTax\n$1,066\nSpecial Tax\n$0\nTotal Tax\n$1,066\nExclusions\nEntire Property\nType Of Ownership\nPrivate\nWill Sell\nCash\nAssessment & Tax\nAssessment Year\n2019\n2018\n2017\nAssessed Value - Total\n$17,240\n$15,380\n$15,380\nAssessed Value - Land\n$2,400\n$1,920\n$1,920\nAssessed Value - Improved\n$14,840\n$13,460\n$13,460\nYOY Change ($)\n$1,860\n$\nYOY Change (%)\n12%\n0%\nTax Year\n2019\n2018\n2017\nTotal Tax\n$1,178.32\n$1,065.64\n$1,064.30\nYOY Change ($)\n$113\n$1\nYOY Change (%)\n11%\n0%\nNotes for you and your agent\nAdd Note\nMap data ©2020\nTerms of Use\nReport a map error\nMap\n200 ft \nParcel Disclaimer'

但是,我已经用WebDriverWait看到了其他一些示例,但是到目前为止,我还没有成功,我认为这将极大地加快抓取速度,这是我编写的代码

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = "https://matrix.heartlandmls.com/Matrix/Public/Portal.aspx?L=1&k=990316X949Z&p=DE-74613894-421"
h_table = []
xpath = '/html/body/form/div[3]/div/div/div[5]/div[2]/div/div[1]/div/div/span/ul/li[2]/a'
driver = webdriver.Firefox()
driver.get(url)
driver.find_element_by_xpath("/html/body/form/div[3]/div/div/div[5]/div[3]/span[2]/div/div/div[2]/div[1]/div/div/div[2]/div[2]/div[1]/span/a").click()
time.sleep(10)
while True:
    button = driver.find_elements_by_xpath("/html/body/form/div[3]/div/div/div[5]/div[2]/div/div[1]/div/div/span/ul/li[2]/a")
    if len(button) < 1:
        print('done')
        break
    else:
        h_table.append(driver.find_element_by_id("wrapperTable").text)
        WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, 'xpath'))).click()

这似乎可以提供所有结果,但是它提供了重复项,如果没有键盘中断,我将无法停止

calling len(h_table) = 258, where it should be 200

1 个答案:

答案 0 :(得分:1)

如果列表长度有问题,为什么不使用:

    if len(h_table) >= 200:
        print("done")
        break
相关问题