Selenium爬行不同的html结构与美丽的汤

时间:2015-10-06 15:08:43

标签: html selenium xpath beautifulsoup web-crawler

我一直跑到墙上。

正在收获的各个xpath:

/html/body/div[8]/div/div[1]/div/div[3]/div[2]/div[2]/h2/a
/html/body/div[8]/div/div[1]/div/div[3]/div[17]/div[2]/div[2]/h2/a

我想从网页上解析上述xpath的相应项目。

这是我的代码:

for j in range(2, innerElements):

            headline = driver.find_element_by_xpath("/html/body/div[8]/div/div[1]/div/div[3]/div["+str(j)+"]/div[2]/h2/a").text
            if headline:
                print(headline)
            elif headline:
                headline = driver.find_element_by_xpath("/html/body/div[8]/div/div[1]/div/div[3]/div[17]/div["+str(j)+"]/div[2]/h2/a").text
                print(headline)

结果:

New York Dinner Cruise
Big Apple Helicopter Tour of New York
Empire State Building Tickets - Observatory and Optional Skip the Line Tickets
Washington DC Day Trip from New York
New York City Explorer Pass
Circle Line: Complete Manhattan Island Cruise
2-Day Niagara Falls Tour from New York by Bus
Viator VIP: Empire State Building, Statue of Liberty and 9/11 Memorial
Big Bus New York Hop-on Hop-off Tour
New York CityPass
9/11 Memorial and Ground Zero Walking Tour with Optional 9/11 Museum Upgrade
New York in One Day Guided Sightseeing Tour
Viator Exclusive: Niagara Falls Day Trip from New York by Private Plane
Viator Exclusive: Statue of Liberty Monument Access and 9/11 Memorial
New York City Guided Sightseeing Tour by Luxury Coach

E
======================================================================
ERROR: test_sel (__main__.Crawling)
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:/Users/PycharmProjects/unti/US.py", line 53, in    test_sel
headline =   driver.find_element_by_xpath("/html/body/div[8]/div/div[1]/div/div[3]/div["+str(j)+"]/div[2]/h2/a").text
 File "C:\Python34\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 250, in find_element_by_xpath
 return self.find_element(by=By.XPATH, value=xpath)
 File "C:\Python34\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 692, in find_element
 {'using': by, 'value': value})['value']
 File "C:\Python34\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 193, in execute
 self.error_handler.check_response(response)
 File "C:\Python34\lib\site- packages\selenium\webdriver\remote\errorhandler.py", line 181, in check_response
 raise exception_class(message, screen, stacktrace)
 selenium.common.exceptions.NoSuchElementException: Message: Unable to locate  element:   {"method":"xpath","selector":"/html/body/div[8]/div/div[1]/div/div[3]/div[17]/div[2]/h2/a"}
 Stacktrace:
 at FirefoxDriver.prototype.findElementInternal_ (file:///C:/Users/hmattu/AppData/Local/Temp/tmp7kbz_wz2/extensions/fxdriver@goog lecode.com/components/driver-component.js:10667)
 at FirefoxDriver.prototype.findElement (file:///C:/Users/hmattu/AppData/Local/Temp/tmp7kbz_wz2/extensions/fxdriver@googlecode.com/components/driver-component.js:10676)
 at DelayedCommand.prototype.executeInternal_/h (file:///C:/Users/hmattu/AppData/Local/Temp/tmp7kbz_wz2/extensions/fxdriver@googlecode.com/components/command-processor.js:12643)
  at DelayedCommand.prototype.executeInternal_ (file:///C:/Users/hmattu/AppData/Local/Temp/tmp7kbz_wz2/extensions/fxdriver@googlecode.com/components/command-processor.js:12648)at DelayedCommand.prototype.execute/<  (file:///C:/Users/hmattu/AppData/Local/Temp/tmp7kbz_wz2/extensions/fxdriver@goog lecode.com/components/command-processor.js:12590)

----------------------------------------------------------------------
Ran 1 test in 27.367s

FAILED (errors=1)

我从第一个xpath得到了预期的结果,但我不知道为什么它不会切换到第二个xpath,如果它再也找不到第一个xpath了。

Unable to locate element: {"method":"xpath","selector":"/html/body/div[8]/div/div[1]/div/div[3]/div[17]/div[2]/h2/a"}

有人可以就此提供反馈意见吗?任何反馈都表示赞赏

修改

指向网页的链接: http://www.viator.com/New-York-City/d687-allthingstodo

1 个答案:

答案 0 :(得分:1)

我建议你使用css选择器。

您正在寻找的CSS选择器是

.bd h2.product-title a

我不确定python,我在Java中知道。但我猜,

headlines = driver.find_elements_by_css_selector(".bd h2.product-title a")

for headline in headlines:
    print(headline.text)