webscraping无法获取页面的源代码

时间:2018-11-12 06:04:40

标签: python web-scraping beautifulsoup request

我正在尝试刮擦https://onlineservices.ocswssw.org/Thinclient/Public/PR/EN 下面是代码。

import requests
from bs4 import BeautifulSoup as BS

sess = requests.session()
html = sess.get(url,headers={'User-Agent': 'Mozilla/5.0'},allow_redirects=True)
Soup = BS(html.text,'lxml')
with open('ocswssw.html,'w') as f:
    print(Soup.prettify())

如果您比较ocswssw.html和chrome网站。他们不匹配。

但是我收到的源代码有些不完整。请让我知道出了什么问题。

我不喜欢在浏览器弹出窗口中使用硒。

3 个答案:

答案 0 :(得分:0)

对于您最终要完成的工作,我还不太清楚,但是在收到信息源时,我会:

1) 使用open()方法和

为您的 ocswssw.html 参数添加了缺少的撇号

2) 运行该代码,并获得与Google Chrome浏览器几乎相同的来源。

BS的结果

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   OCSWSSW | Member Search
  </title>
  <link href="/Thinclient/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
  <link href="/Thinclient/Content/bootstrap.071220161413.css" rel="stylesheet" type="text/css"/>
  <link href="/Thinclient/Content/kendo/kendo.common-bootstrap.min.css" rel="stylesheet"/>
  <link href="/Thinclient/Content/kendo/kendo.bootstrap.min.css" rel="stylesheet"/>
  <link href="/Thinclient/Content/ThinStyle.110820150951.css" rel="stylesheet" title="Blue" type="text/css"/>
  <link href="//maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css" rel="stylesheet"/>
  <link href="/Thinclient/Content/icheck/square/blue.css" rel="stylesheet"/>
  <link href="/Thinclient/Content/GlobalStyleSheet.css" rel="stylesheet"/>
  <script type="text/javascript">
   HomeURL = "#/forms/new/?table=0x800000000000003D&amp;form=0x800000000000004D&amp;command=0x8000000000000C2D";
        AfterLoginData = null

        LanguageDictionary = {};

        LanguageDictionary.TC_COMMON = {"OkButtonTextOK":"Ok","OkButtonTextContinue":"Continue","OkButtonTextYes":"Yes","OkButtonTextDelete":"Clear","CancelButtonTextCancel":"Cancel","CancelButtonTextNo":"No","CancelButtonTextLogout":"Logout","MiddleButtonTextNo":"No","AjaxRequestError":"The Web server does not respond currently. Please try again later.","UserIdleMessage":"You are innactive, do you want to continue or you disconnect?","ErrorTitle":"Error","ErrorHeaderTitle":"Application error","ErrorHeaderText":"An application error has occurred while processing the current request. The error was recorded and sent to the site administrator. Provide your administrator ID error below.","ErrorMessage":"Message:","ErrorIdentifier":"Identify:","ErrorDate":"Date:"}

        LanguageDictionary.TC_SEARCH = {"OperatorNotEqual":"Not =","OperatorIsDefined":"Is Defined","OperatorIsNotDefined":"Is Not Defined","OperatorContains":"Contains","OperatorDoesNotContain":"Does not contain","OperatorBeginsWith":"Begins with","OperatorDoesNotBeginWith":"Does not begin with","OperatorIsEmpty":"Is Empty","OperatorIsNotEmpty":"Is not empty","CustomFiltersNotComplete":"One or more custom filters are not complete. Examine each custom filter and make sure that the valid search criteria are provided.","NavigateAwayFromSearchWithFilterSet":"You are about to leave this page without performing the search filters custom.","NoGlobalSearchPermissions":"Password","SearchDefinitionLostAlert":"The definition of research will be lost if the primary table is changed. Are you sure you want to change the primary table of the research."}

        LanguageDictionary.TC_FORM = {"RequiredFieldsNotSet":"Unable to save the form data. Provide a value for all required fields.","NavigateAwayFromUnsavedForm":"You are about to exit the form without saving it","RefreshFormLosesModifiedData":"The data of the form has changed. The changes you made will be lost when you refresh the form. Do you want to continue?","SaveDataBeforeClose":"The data of the form has changed. Do you want to save them before closing?","DeleteWarning":"The form data will be deleted. Are you sure you want to continue?","DeleteSecondaryWarning":"You are about to delete the form data.","RequiredField":"This is a required field","InvalidFormat":"The format for this field is not valid"}

        LanguageDictionary.TC_GLOBALSEARCH = {"CollapseAllLabel":"Reduce everything","ExpandAllLabel":"About expand"}

        LanguageDictionary.TC_WIDGETS = {"CallListItem":"Appeal","FaxListItem":"Fax","SmsListItem":"SMS"}
  </script>
  <script src="/Thinclient/Scripts/jquery-1.11.1.min.js" type="text/javascript">
  </script>
  <script src="/Thinclient/Scripts/jquery-migrate-1.2.1.min.js" type="text/javascript">
  </script>
  <script src="/Thinclient/Scripts/icheck.min.js">
  </script>
  <script src="/Thinclient/Scripts/kendo/kendo.all.min.js">
  </script>
  <script src="/Thinclient/Scripts/kendo/kendo.timezones.min.js">
  </script>
  <script src="/Thinclient/Scripts/kendo/kendo.aspnetmvc.min.js">
  </script>
  <script src="/Thinclient/Scripts/kendo/cultures/kendo.culture.en-US.min.js">
  </script>
  <script>
   kendo.culture("en-US");
  </script>
 </head>
 <body class="k-content">
  <div class="k-loading-mask" id="loadingMsg" style="width:100%;height:100%">
   <span class="k-loading-text">
    Loading...
   </span>
   <div class="k-loading-image">
    <div class="k-loading-color">
    </div>
   </div>
  </div>
  <input id="hdPollingFrequency" type="hidden" value="32767"/>
  <input id="hdPrivateComputerTimeout" type="hidden" value="32767"/>
  <input id="hdPublicComputerTimeout" type="hidden" value="32767"/>
  <input id="hdWarningDisplayDuration" type="hidden" value="0"/>
  <input id="hdWindowsAuthentication" type="hidden" value="false"/>
  <div class="container">
   <div id="content">
   </div>
  </div>
  <div id="loading" style="display: none;">
   <h1>
    We are processing your request. Please be patient.
   </h1>
   <input class="abortButton" type="button" value="Abort"/>
  </div>
  <script id="taskpadGroupTmpl" type="text/x-jquery-tmpl">
   <div class="panelBlock">
            <div class="panelTitle"><div class="panelLink"><a class="panelDD-dn" id="${DisplayName}" href="#">${DisplayName}</a></div><div class = "imgPanel">
            <a class="imgPanelDD" href="#">&nbsp;</a></div>
            </div>
                    <div class="panelContent1" id="panelContent1 + ${DisplayName}">
                        <ul>
                            {{tmpl(TaskItemCollection) "#taskpadItemTmpl"}}
                        </ul>                   
                    </div>            
            </div>
  </script>
  <script id="KendoTestTemplate" type="text/x-kendo-template">
   <h2>#= test #</h2>
        <ul>
                        #= kendo.render(kendo.template($("\\#KendoTestLiTemplate").html()), litest) #
        </ul>
  </script>
  <script id="KendoTestLiTemplate" type="text/x-kendo-template">
   <li>#= displayName#</li>
  </script>
  <script id="ErrorTemplate" type="text/x-jquery-tmpl">
   <div class="errorMsg k-widget k-notification k-notification-error " data-role="alert" style="display: block; opacity: 1;">
            <div class="k-notification-wrap">
                <span class="k-icon k-i-note">
                    error
                </span>
                ${errorMsg}
                <span class="k-icon k-i-close">
                    Hide
                </span>
            </div>
        </div>
  </script>
  <script id="HelpButtonTemplate" type="text/x-jquery-tmpl">
   <button class="k-button k-primary helpButton" id="${id}" onclick="return false;">?</button>
  </script>
  <script id="IconTemplate" type="text/x-jquery-tmpl">
   <span class="k-icon ${icon}"></span>
  </script>
  <script id="trash" type="text/x-kendo-template">
   <li style="background: url(./Images/#=item.ImageId#.#=item.ImageHash#.#=item.ImageFileExtension#) no-repeat;"><a href="#=item.ActionCommand#" #if (item.ShowInNewWindow){# target="_blank" #}# class="#if (!item.ShowInNewWindow){# ajax-links #} if (item.ContentType == 'Email'){# mailto-links #}# linkTaskItem">#=item.DisplayName#</a></li>
    <li class="#=GetCssClass(item.ContentType)#"><a href="#=item.ActionCommand#" #if (item.ShowInNewWindow){# target="_blank" #}# class="#if (!item.ShowInNewWindow){# ajax-links #} if (item.ContentType == 'Email'){# mailto-links #}# linkTaskItem">#=item.DisplayName#</a></li>
  </script>
  <script id="taskpadItemTmpl" type="text/x-jquery-tmpl">
   {{if ImageId}}
            <li style="background: url(./Images/${ImageId}.${ImageHash}.${ImageFileExtension}) no-repeat;"><a href="${ActionCommand}" {{if ShowInNewWindow}} target="_blank" {{/if}} class="{{if !ShowInNewWindow}} ajax-links {{/if}} {{if (ContentType == 'Email')}} mailto-links {{/if}} linkTaskItem">${DisplayName}</a></li>
      {{else}}
        <li class="${GetCssClass(ContentType)}"><a href="${ActionCommand}" {{if ShowInNewWindow}} target="_blank" {{/if}} class="{{if !ShowInNewWindow}} ajax-links {{/if}} {{if (ContentType == 'Email')}} mailto-links {{/if}} linkTaskItem">${DisplayName}</a></li>
       {{/if}}
  </script>
  <script id="buttonBarButtonTmpl" type="text/x-jquery-tmpl">
   <button value="submit" class="submitBtn k-button k-primary" data-actionCommand="${Action}" data-Disabled="${Disabled}" data-Visible="${Visible}" data-Name="${Name}">
                            <span>${DisplayName}</span>
        </button>
  </script>
  <script src="/Thinclient/Scripts/jquery.filedownload.150420151637.js" type="text/javascript">
  </script>
  <script src="/Thinclient/Scripts/jquery.tmpl.min.150420151637.js" type="text/javascript">
  </script>
  <script src="/Thinclient/Scripts/pubsub.150420151641.js" type="text/javascript">
  </script>
  <script src="/Thinclient/Scripts/jquery.form.150420151637.js" type="text/javascript">
  </script>
  <script src="/Thinclient/Scripts/bootstrap.min.150420151637.js" type="text/javascript">
  </script>
  <script src="/Thinclient/Scripts/sameheight.min.150420151641.js" type="text/javascript">
  </script>
  <script src="/Thinclient/Scripts/Core.141120161617.js" type="text/javascript">
  </script>
  <script src="/Thinclient/Scripts/PivotalThinClient.150420151641.js" type="text/javascript">
  </script>
 </body>
</html>

来自浏览器源的结果

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>OCSWSSW | Member Search</title>
    <link href="/Thinclient/favicon.ico" type="image/x-icon" rel="shortcut icon" />
    <link href="/Thinclient/Content/bootstrap.071220161413.css" rel="stylesheet" type="text/css" />
    <link rel="stylesheet" href="/Thinclient/Content/kendo/kendo.common-bootstrap.min.css" />
    <link rel="stylesheet" href="/Thinclient/Content/kendo/kendo.bootstrap.min.css" />
    <link href="/Thinclient/Content/ThinStyle.110820150951.css" rel="stylesheet" title="Blue" type="text/css" />
    <link rel="stylesheet" href="//maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css">    
    <link rel="stylesheet" href="/Thinclient/Content/icheck/square/blue.css" />

            <link rel="stylesheet" href="/Thinclient/Content/GlobalStyleSheet.css" />


    <script type="text/javascript" >
        HomeURL = "#/forms/new/?table=0x800000000000003D&amp;form=0x800000000000004D&amp;command=0x8000000000000C2D";
        AfterLoginData = null

        LanguageDictionary = {};

        LanguageDictionary.TC_COMMON = {"OkButtonTextOK":"Ok","OkButtonTextContinue":"Continue","OkButtonTextYes":"Yes","OkButtonTextDelete":"Clear","CancelButtonTextCancel":"Cancel","CancelButtonTextNo":"No","CancelButtonTextLogout":"Logout","MiddleButtonTextNo":"No","AjaxRequestError":"The Web server does not respond currently. Please try again later.","UserIdleMessage":"You are innactive, do you want to continue or you disconnect?","ErrorTitle":"Error","ErrorHeaderTitle":"Application error","ErrorHeaderText":"An application error has occurred while processing the current request. The error was recorded and sent to the site administrator. Provide your administrator ID error below.","ErrorMessage":"Message:","ErrorIdentifier":"Identify:","ErrorDate":"Date:"}

        LanguageDictionary.TC_SEARCH = {"OperatorNotEqual":"Not =","OperatorIsDefined":"Is Defined","OperatorIsNotDefined":"Is Not Defined","OperatorContains":"Contains","OperatorDoesNotContain":"Does not contain","OperatorBeginsWith":"Begins with","OperatorDoesNotBeginWith":"Does not begin with","OperatorIsEmpty":"Is Empty","OperatorIsNotEmpty":"Is not empty","CustomFiltersNotComplete":"One or more custom filters are not complete. Examine each custom filter and make sure that the valid search criteria are provided.","NavigateAwayFromSearchWithFilterSet":"You are about to leave this page without performing the search filters custom.","NoGlobalSearchPermissions":"Password","SearchDefinitionLostAlert":"The definition of research will be lost if the primary table is changed. Are you sure you want to change the primary table of the research."}

        LanguageDictionary.TC_FORM = {"RequiredFieldsNotSet":"Unable to save the form data. Provide a value for all required fields.","NavigateAwayFromUnsavedForm":"You are about to exit the form without saving it","RefreshFormLosesModifiedData":"The data of the form has changed. The changes you made will be lost when you refresh the form. Do you want to continue?","SaveDataBeforeClose":"The data of the form has changed. Do you want to save them before closing?","DeleteWarning":"The form data will be deleted. Are you sure you want to continue?","DeleteSecondaryWarning":"You are about to delete the form data.","RequiredField":"This is a required field","InvalidFormat":"The format for this field is not valid"}

        LanguageDictionary.TC_GLOBALSEARCH = {"CollapseAllLabel":"Reduce everything","ExpandAllLabel":"About expand"}

        LanguageDictionary.TC_WIDGETS = {"CallListItem":"Appeal","FaxListItem":"Fax","SmsListItem":"SMS"}
    </script>
    <script type="text/javascript" src="/Thinclient/Scripts/jquery-1.11.1.min.js"></script>
    <script src="/Thinclient/Scripts/jquery-migrate-1.2.1.min.js" type="text/javascript"></script>

    <script src="/Thinclient/Scripts/icheck.min.js"></script>
    <script src="/Thinclient/Scripts/kendo/kendo.all.min.js"></script>
    <script src="/Thinclient/Scripts/kendo/kendo.timezones.min.js"></script>
    <script src="/Thinclient/Scripts/kendo/kendo.aspnetmvc.min.js"></script>

    <script src="/Thinclient/Scripts/kendo/cultures/kendo.culture.en-US.min.js"></script>
    <script>
        kendo.culture("en-US");
</script>




</head>
<body class="k-content">
    <div id="loadingMsg" class="k-loading-mask" style="width:100%;height:100%">
        <span class="k-loading-text">Loading...</span>
        <div class="k-loading-image">
            <div class="k-loading-color"></div>
        </div>
    </div>

    <input type="hidden" ID="hdPollingFrequency" value="32767"/>
    <input type="hidden" ID="hdPrivateComputerTimeout" value= "32767"/>
    <input type="hidden" ID="hdPublicComputerTimeout" value="32767"/>
    <input type="hidden" ID="hdWarningDisplayDuration" value="0"/>
    <input type="hidden" id="hdWindowsAuthentication" value="false"/>



<div class="container">
    <div id="content">

    </div>
</div>

    <div id="loading" style="display: none;">
        <h1>
            We are processing your request. Please be patient.</h1>
        <input type="button" value="Abort" class="abortButton" />
    </div>
    <script id="taskpadGroupTmpl" type="text/x-jquery-tmpl">
        <div class="panelBlock">
            <div class="panelTitle"><div class="panelLink"><a class="panelDD-dn" id="${DisplayName}" href="#">${DisplayName}</a></div><div class = "imgPanel">
            <a class="imgPanelDD" href="#">&nbsp;</a></div>
            </div>
                    <div class="panelContent1" id="panelContent1 + ${DisplayName}">
                        <ul>
                            {{tmpl(TaskItemCollection) "#taskpadItemTmpl"}}
                        </ul>                   
                    </div>            
            </div>      
    </script>
    <script id="KendoTestTemplate" type="text/x-kendo-template">
        <h2>#= test #</h2>
        <ul>
                        #= kendo.render(kendo.template($("\\#KendoTestLiTemplate").html()), litest) #
        </ul>   
    </script>
    <script id="KendoTestLiTemplate" type="text/x-kendo-template">
        <li>#= displayName#</li>    
    </script>
    <script id="ErrorTemplate" type="text/x-jquery-tmpl">
        <div class="errorMsg k-widget k-notification k-notification-error " data-role="alert" style="display: block; opacity: 1;">
            <div class="k-notification-wrap">
                <span class="k-icon k-i-note">
                    error
                </span>
                ${errorMsg}
                <span class="k-icon k-i-close">
                    Hide
                </span>
            </div>
        </div>
    </script>
    <script id="HelpButtonTemplate" type="text/x-jquery-tmpl">
        <button class="k-button k-primary helpButton" id="${id}" onclick="return false;">?</button>
    </script>
    <script id="IconTemplate" type="text/x-jquery-tmpl">
        <span class="k-icon ${icon}"></span>
    </script>
    <script id="trash" type="text/x-kendo-template">
    <li style="background: url(./Images/#=item.ImageId#.#=item.ImageHash#.#=item.ImageFileExtension#) no-repeat;"><a href="#=item.ActionCommand#" #if (item.ShowInNewWindow){# target="_blank" #}# class="#if (!item.ShowInNewWindow){# ajax-links #} if (item.ContentType == 'Email'){# mailto-links #}# linkTaskItem">#=item.DisplayName#</a></li>
    <li class="#=GetCssClass(item.ContentType)#"><a href="#=item.ActionCommand#" #if (item.ShowInNewWindow){# target="_blank" #}# class="#if (!item.ShowInNewWindow){# ajax-links #} if (item.ContentType == 'Email'){# mailto-links #}# linkTaskItem">#=item.DisplayName#</a></li>

    </script>
    <script id="taskpadItemTmpl" type="text/x-jquery-tmpl">
     {{if ImageId}}
            <li style="background: url(./Images/${ImageId}.${ImageHash}.${ImageFileExtension}) no-repeat;"><a href="${ActionCommand}" {{if ShowInNewWindow}} target="_blank" {{/if}} class="{{if !ShowInNewWindow}} ajax-links {{/if}} {{if (ContentType == 'Email')}} mailto-links {{/if}} linkTaskItem">${DisplayName}</a></li>
      {{else}}
        <li class="${GetCssClass(ContentType)}"><a href="${ActionCommand}" {{if ShowInNewWindow}} target="_blank" {{/if}} class="{{if !ShowInNewWindow}} ajax-links {{/if}} {{if (ContentType == 'Email')}} mailto-links {{/if}} linkTaskItem">${DisplayName}</a></li>
       {{/if}}
    </script>
    <script id="buttonBarButtonTmpl" type="text/x-jquery-tmpl">
        <button value="submit" class="submitBtn k-button k-primary" data-actionCommand="${Action}" data-Disabled="${Disabled}" data-Visible="${Visible}" data-Name="${Name}">
                            <span>${DisplayName}</span>
        </button>
    </script>
    <script src="/Thinclient/Scripts/jquery.filedownload.150420151637.js" type="text/javascript"></script>      
    <script src="/Thinclient/Scripts/jquery.tmpl.min.150420151637.js" type="text/javascript"></script>  
    <script src="/Thinclient/Scripts/pubsub.150420151641.js" type="text/javascript"></script>
    <script src="/Thinclient/Scripts/jquery.form.150420151637.js" type="text/javascript"></script>
    <script src="/Thinclient/Scripts/bootstrap.min.150420151637.js" type="text/javascript"></script>

    <script src="/Thinclient/Scripts/sameheight.min.150420151641.js" type="text/javascript"></script>
    <script src="/Thinclient/Scripts/Core.141120161617.js" type="text/javascript"></script>










    <script src="/Thinclient/Scripts/PivotalThinClient.150420151641.js" type="text/javascript"></script>




</body>
</html>


<script>
    //$( window ).load( 
    //$(".k-state-default").hover(function () {
    //    $(this).toggleClass("k-state-hover");
    //})
    //);
</script>

这不是您在Beautiful Soup中想要的东西吗?

答案 1 :(得分:0)

该页面是使用javascript创建的。 因此,仅使用request / bs4

不能获得页面源

如何隐瞒:使用HeadlessChrome创建由javascript创建的页面源

答案 2 :(得分:0)

它不是动态页面(Ajax),您不能使用bs4,如果您不喜欢硒元素,可以在浏览器弹出窗口中添加--headless选项以将其隐藏。这里的例子

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC

options = Options()
options.add_argument('--headless')
#options.add_argument('--disable-gpu')  # maybe needed if running on Windows.
driver = webdriver.Chrome(chrome_options=options)

print("Loading Page...")
driver.get('https://onlineservices.ocswssw.org/Thinclient/Public/PR/EN/')

# wait max 20 second until ajax content rendered
print("Wait Ajax finished...")
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID , 'MainForm')))

html = driver.execute_script("return document.documentElement.outerHTML")
Soup = BeautifulSoup(html, 'html.parser')
with open('ocswssw.html', 'w') as f:
    sourceCode = Soup.prettify().encode('utf-8')
    f.write(sourceCode)
    print(sourceCode)

driver.quit()