使用Selenium和Xpath进行Web抓取

时间:2020-04-24 07:34:27

标签: python selenium xpath

我是Xpath的新手。我正在尝试抓取一个股票网站,以获取每个元素的名称和价值。 在我的python硒脚本中,本地提取了html_content中网页的主要部分,如下所示。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
dirinstall="C:\\Program Files (x86)\\www\mm\\"
chrome_driver = dirinstall+"\\Webdriver\\chromedriver.exe"
options = Options()
driver = webdriver.Chrome(chrome_driver, options=options)
html_content = """
<html class="ng-scope">
<head data-meta-tags="">
    <title> Stock NYSE </title>
    <ui-layout class="ng-isolate-scope">
        <div data-ng-include="" src="layoutCtrl.template" class="ng-scope">
            <app-root class="ng-scope" _nghost-rqp-c0="" ng-version="8.2.14"></app-root>
            <div ng-class="{'demo-mode': $root.session.user.portfolio.account.type === 'Demo' }" class="ng-scope">
                <div ng-view="" ng-class="layoutCtrl.isBannerShown ? 'banner-shown' : ''" class="main-app-view ng-scope" role="main">
                    <et-discovery-markets-results class="ng-scope" _nghost-rqp-c42="" ng-version="8.2.14">
                        <div _ngcontent-rqp-c42="" class="discover main-content no-footer" ui-fun-scroll="{'class': 'minimize', 'classEl': '.user-head-wrapper, .table-discover', 'scrollContainer': '.table-discover', 'setClassAtScroll': 200 }">
                            <div _ngcontent-rqp-c42="" automation-id="discover-market-results-wrapp" class="table-discover markets-table">
                                <et-discovery-markets-results-list _ngcontent-rqp-c42="" automation-id="discover-market-results-sub-view-list" _nghost-rqp-c44="" class="ng-star-inserted">
                                    <div _ngcontent-rqp-c44="" class="market-list list-view" data-etoro-locale-ns="discoverMarketResultsList">
                                        <et-instrument-mobile-row _ngcontent-rqp-c44="" automation-id="discover-market-results-row" _nghost-rqp-c18="" class="ng-star-inserted">
                                            <et-instrument-trading-mobile-row _ngcontent-rqp-c18="" automation-id="watchlist-grid-instruments-list" _nghost-rqp-c47="" class="ng-star-inserted">
                                                <div _ngcontent-rqp-c47="" class="row-wrap">
                                                    <div _ngcontent-rqp-c47="" automation-id="watchlist-item-list-wrapp-instrument" class="instrument-cell name-cell">
                                                        <div _ngcontent-rqp-c47="" class="avatar-img-wrap"> </div>
                                                        <div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-wrapp-instrument-info" class="avatar-info">
                                                            <div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-name" class="symbol">A</div>
                                                            <div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-full-name" class="name positive"> 0.68 (0.90%) </div>
                                                        </div>
                                                    </div>
                                                    <et-buy-sell-buttons _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-buy-sell-container" class="instrument-cell buy-sell-buttons" _nghost-rqp-c24="">
                                                        <et-buy-sell-button _ngcontent-rqp-c24="" _nghost-rqp-c27="">
                                                            <div _ngcontent-rqp-c27="" class="prices no-label positive-change" automation-id="buy-sell-button-container-sell">
                                                                <div _ngcontent-rqp-c27="" class="trade-button-title">S</div>
                                                                <div _ngcontent-rqp-c27="" automation-id="buy-sell-button-rate-value" class="price">75.<span class="after-decimal">85</span></div>
                                                            </div>
                                                        </et-buy-sell-button>
                                                        <div _ngcontent-rqp-c24="" class="space-gap"></div>
                                                        <et-buy-sell-button _ngcontent-rqp-c24="" _nghost-rqp-c27="">
                                                            <div _ngcontent-rqp-c27="" class="prices no-label negative-change" automation-id="buy-sell-button-container-buy">
                                                                <div _ngcontent-rqp-c27="" class="trade-button-title">B</div>
                                                                <div _ngcontent-rqp-c27="" automation-id="buy-sell-button-rate-value" class="price">76.<span class="after-decimal">03</span></div>
                                                            </div>
                                                        </et-buy-sell-button>
                                                    </et-buy-sell-buttons>
                                                </div>
                                                <et-trade-item-card-action _ngcontent-rqp-c18="" _nghost-rqp-c15="">
                                                </et-trade-item-card-action>
                                            </et-instrument-trading-mobile-row>
                                        </et-instrument-mobile-row>
                                        <et-instrument-mobile-row _ngcontent-rqp-c44="" automation-id="discover-market-results-row" _nghost-rqp-c18="" class="ng-star-inserted">
                                            <et-instrument-trading-mobile-row _ngcontent-rqp-c18="" automation-id="watchlist-grid-instruments-list" _nghost-rqp-c47="" class="ng-star-inserted">
                                                <div _ngcontent-rqp-c47="" class="row-wrap">
                                                    <div _ngcontent-rqp-c47="" automation-id="watchlist-item-list-wrapp-instrument" class="instrument-cell name-cell">
                                                        <div _ngcontent-rqp-c47="" class="avatar-img-wrap"> </div>
                                                        <div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-wrapp-instrument-info" class="avatar-info">
                                                            <div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-name" class="symbol">AA</div>
                                                            <div _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-full-name" class="name negative"> -0.11 (-1.46%) </div>
                                                        </div>
                                                    </div>
                                                    <et-buy-sell-buttons _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-buy-sell-container" class="instrument-cell buy-sell-buttons" _nghost-rqp-c24="">
                                                        <et-buy-sell-button _ngcontent-rqp-c24="" _nghost-rqp-c27="">
                                                            <div _ngcontent-rqp-c27="" class="prices no-label negative-change" automation-id="buy-sell-button-container-sell">
                                                                <div _ngcontent-rqp-c27="" class="trade-button-title">S</div>
                                                                <div _ngcontent-rqp-c27="" automation-id="buy-sell-button-rate-value" class="price">7.<span class="after-decimal">44</span></div>
                                                            </div>
                                                        </et-buy-sell-button>
                                                        <div _ngcontent-rqp-c24="" class="space-gap"></div>
                                                        <et-buy-sell-button _ngcontent-rqp-c24="" _nghost-rqp-c27="">
                                                            <div _ngcontent-rqp-c27="" class="prices no-label negative-change" automation-id="buy-sell-button-container-buy">
                                                                <div _ngcontent-rqp-c27="" class="trade-button-title">B</div>
                                                                <div _ngcontent-rqp-c27="" automation-id="buy-sell-button-rate-value" class="price">7.<span class="after-decimal">47</span></div>
                                                            </div>
                                                        </et-buy-sell-button>
                                                    </et-buy-sell-buttons>
                                                </div>
                                                <et-trade-item-card-action _ngcontent-rqp-c18="" _nghost-rqp-c15="">
                                                </et-trade-item-card-action>
                                            </et-instrument-trading-mobile-row>
                                        </et-instrument-mobile-row>
                                    </div>
                                </et-discovery-markets-results-list>
                            </div>
                        </div>
                    </et-discovery-markets-results>
                </div>
            </div>
        </div>
    </ui-layout>
    </body>

</html>
"""

driver.get("data:text/html;charset=utf-8,{html_content}".format(html_content=html_content))
#results = driver.find_elements_by_xpath("//*[@class='ng-star-inserted']")
results = driver.find_elements_by_xpath("//*[et-instrument-mobile-row and @class='ng-star-inserted']")
print('Number of results', len(results))

我不知道为什么如果我搜索“ et-instrument-mobile-row”,我只会得到1个元素而不是2个元素,并且如果我同时搜索“ et-instrument-mobile-row”和“ ng-star-插入”,我得到0个元素。 通过查看示例,我的目标是获取买入/卖出的代码和当前值(价格和十进制小数)。

类似的东西:

[A,75.85,76.03]

[AA,7.44,7.47]

有人可以帮助我吗?谢谢!

1 个答案:

答案 0 :(得分:0)

您似乎有一些格式错误的HTML,Selenium不确定如何解析它。我注意到这一行:

 <div _ngcontent-rqp-c47="" class="avatar-img-wrap"><img _ngcontent-rqp-c47="" automation-id="watchlist-item-grid-instrument-avatar" class="avatar-img" src="https://etoro-cdn.etorostatic.com/market-avatars/a/150x150.png" alt="Agilent Technologies Inc">

<img>标签未关闭。您会发现语法高亮在这里也感到困惑。

否则,您正在搜索的XPath通常看起来格式正确。

编辑:仔细查看。您的属性名称应位于*所在的位置。 这是您的XPath:

"//et-instrument-mobile-row[@class='ng-star-inserted']"

编辑2:Asker对于如何使用上述XPath在所找到的内容中进行搜索还有其他疑问。

要在这些元素中查找更多元素,请查看the documentation,每个硒WebElement提供其自己的find_element方法。然后,您可以使用它们在我们刚刚找到的那些元素中进行进一步搜索(请确保在XPath中使用.//,因为您只想遍历该特定元素的内容-其他find_elements没有此警告)。

一旦确定了包含符号和价格的元素,就可以简单地引用这些元素上的text属性。让我们看一个简单的例子:

<div class="a">
  <div class="b" id="1">B</div>
  <div class="c" id="2">2</div>
  <div class="d" id="3">22</div>
</div>

假设我们已经在此处找到根div,并将其存储在名为element的变量中。然后:

symbol = element.find_element_by_xpath(".//*[@class='b']").text
integral = element.find_element_by_xpath(".//*[@class='c']").text
fractional = element.find_element_by_xpath(".//*[@class='d']").text

通常,如果您可以通过XPath以外的其他方式进行搜索,则对每个涉事人员来说都更加容易。这是使用类名称完成此操作的一种更典型的方法:

symbol = element.find_element_by_class_name("b").text
integral = element.find_element_by_class_name("c").text
fractional = element.find_element_by_class_name("d").text

编辑3:作者的注释

在@firstbass的宝贵帮助下,我深入研究以获得代号和不同的买卖价格,如下所示:

for element in results:
    symbol = element.find_element_by_xpath(".//*[@class='symbol']").text
    print(str(symbol))
    sell = element.find_element_by_xpath(".//et-buy-sell-buttons//et-buy-sell-button//div[@automation-id='buy-sell-button-container-sell']")
    sell_integral = sell.find_element_by_xpath(".//*[@class='price']").text
    sell_fractional = sell.find_element_by_xpath(".//*[@class='after-decimal']").text
    print(str(sell_integral)+':'+str(sell_fractional))
    buy = element.find_element_by_xpath(".//et-buy-sell-buttons//et-buy-sell-button//div[@automation-id='buy-sell-button-container-buy']")
    buy_integral = buy.find_element_by_xpath(".//*[@class='price']").text
    buy_fractional = buy.find_element_by_xpath(".//*[@class='after-decimal']").text
    print(str(buy_integral)+':'+str(buy_fractional))