动态创建的网页内容的Python xpath HTML抓取问题

时间:2016-01-18 01:56:09

标签: python xpath web-scraping

我是编程的全新手,所以如果我犯了任何愚蠢的错误,请原谅。下面是我的HTML解析结果,使用了Python及其xpath函数。不幸的是,我无法准确访问我感兴趣的网站部分(变量wiki3-wiki8不返回元素,只返回空“[]”)。该程序只拉取列表的第一个元素wiki2 = / div [1],但它的子,wiki3 = / div [1] / a,也不是它的任何兄弟,例如wiki4 = * / div [2],我需要的所有这些。

我认为问题可能与我试图访问的网站内容是动态创建的事实有关,即只显示列表的~8000个元素中的前12个 - 如果一个滚动,则仅在屏幕上显示其他元素一直到页面底部(java脚本似乎对此负责,请参阅HTML源代码中的wiki7)。

下面的材料包含两个部分 - 第一部分是程序的Python输出,而第二部分则显示我感兴趣的网页HTML部分。

用于HTML解析的Python脚本(目前仅用于查看是否可以通过xpath捕获网站的所有部分):

Python 3.5.1rc1 (v3.5.1rc1:948ef16a6951, Nov 22 2015, 23:41:41) [MSC v.1900 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import lxml
>>> import requests
>>> from lxml import html
>>> page=requests.get('http://www.wikifolio.com/de/Invest/SearchWikifolio#/?tags=aktde,akteur,aktusa,akthot,aktint,etf,fonds,anlagezert,hebel&media=true&private=true&assetmanager=true&theme=true&super=true&WithoutLeverageProductsOnly=true')
>>> tree=html.fromstring(page.content)
>>> wiki1=tree.xpath('/html/body/div[3]/div/div[2]/div[4]/form/div[2]/div/div[2]')

>>> wiki1
[<Element div at 0x5af6af1b88>]
>>> wiki2=tree.xpath('/html/body/div[3]/div/div[2]/div[4]/form/div[2]/div/div[2]/div[1]')

>>> wiki2
[<Element div at 0x5af6af1ea8>]
>>> wiki3=tree.xpath('/html/body/div[3]/div/div[2]/div[4]/form/div[2]/div/div[2]/div[1]/a')

>>> wiki3
[]
>>> wiki4=tree.xpath('/html/body/div[3]/div/div[2]/div[4]/form/div[2]/div/div[2]/div[2]')

>>> wiki4
[]
>>> wiki5=tree.xpath('/html/body/div[3]/div/div[2]/div[4]/form/div[2]/div/div[2]/div[19]')

>>> wiki5
[]
>>> wiki6=tree.xpath('/html/body/div[3]/div/div[2]/div[4]/form/div[2]/div/div[2]/div[37]')

>>> wiki6
[]
>>> wiki7=tree.xpath('/html/body/div[3]/div/div[2]/div[4]/form/div[2]/div/div[2]/a')

>>> wiki7
[]
>>> wiki8=tree.xpath('/html/body/div[3]/div/div[2]/div[4]/form/div[2]/div/div[2]/script[3]')

>>> wiki8
[]
>>> wiki9=tree.xpath('/html/body/div[3]/div/div[2]/div[4]/script')

>>> wiki9
[<Element script at 0x5af6af1ef8>]
>>> 

HTML源代码(包括** **中手动添加的变量名称):

**wiki1**   <div class="search-result-content js-search-result-container" style="">
**wiki2**       <div class="wikifolio_item" style="display: block;">
            <div class="actions">
**wiki3**           <a href="/de/SG1979">
            <h6>
            <a href="/de/SG1979">DACH-Trading&Invest</a>
            </h6>
            <div class="wikifolio-bar">
            <div class="search-result-detail-info">
            <div class="search-result-wikifolio-tags">
            </div>
**wiki4**       <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <script type="text/javascript">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
**wiki5**       <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <script type="text/javascript">
        <div id="guiSearchParams" data-directmatchurl="" data-params="{"searchTerm":null,"orderBy":"topwikis","realMoneyInvestorsOnly":false,"savingPlanOnly":false,"leverageProductsOnly":false,"withoutLeverageProductsOnly":true,"showPrivateWikifolios":true,"showMediaWikifolios":true,"showAssetManagerWikifolios":true,"showThemeWikifolios":true,"showSuperWikifolios":true,"austrianLicenseOnly":false,"swissLicenseOnly":false,"tagFilters":["aktde","akteur","aktusa","akthot","aktint","etf","fonds","anlagezert","hebel"],"rankingFilters":{}}"></div>
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
        <div class="wikifolio_item" style="display: block;">
**wiki6**       <div class="wikifolio_item" style="display: block;">
**wiki7**       <a id="loadMoreProjects" onclick="checkPositionBeforeLoading(); event.preventDefault ? event.preventDefault() : event.returnValue = false;" href="#">weitere Wikifolios anzeigen</a>
**wiki8**       <script type="text/javascript">
    </div>
    </div>
    </div>
    </form>
**wiki9**   <script type="text/javascript">

有没有人知道为什么只存在部分可访问网站内容的问题以及如何规避它?

我向能帮助我的人表示永远的感激!谢谢:))

0 个答案:

没有答案