我正在尝试刮擦https://onlineservices.ocswssw.org/Thinclient/Public/PR/EN 下面是代码。
import requests
from bs4 import BeautifulSoup as BS
sess = requests.session()
html = sess.get(url,headers={'User-Agent': 'Mozilla/5.0'},allow_redirects=True)
Soup = BS(html.text,'lxml')
with open('ocswssw.html,'w') as f:
print(Soup.prettify())
如果您比较ocswssw.html
和chrome网站。他们不匹配。
但是我收到的源代码有些不完整。请让我知道出了什么问题。
我不喜欢在浏览器弹出窗口中使用硒。
答案 0 :(得分:0)
对于您最终要完成的工作,我还不太清楚,但是在收到信息源时,我会:
1) 使用open()方法和
为您的 ocswssw.html 参数添加了缺少的撇号2) 运行该代码,并获得与Google Chrome浏览器几乎相同的来源。
BS的结果
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>
OCSWSSW | Member Search
</title>
<link href="/Thinclient/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="/Thinclient/Content/bootstrap.071220161413.css" rel="stylesheet" type="text/css"/>
<link href="/Thinclient/Content/kendo/kendo.common-bootstrap.min.css" rel="stylesheet"/>
<link href="/Thinclient/Content/kendo/kendo.bootstrap.min.css" rel="stylesheet"/>
<link href="/Thinclient/Content/ThinStyle.110820150951.css" rel="stylesheet" title="Blue" type="text/css"/>
<link href="//maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css" rel="stylesheet"/>
<link href="/Thinclient/Content/icheck/square/blue.css" rel="stylesheet"/>
<link href="/Thinclient/Content/GlobalStyleSheet.css" rel="stylesheet"/>
<script type="text/javascript">
HomeURL = "#/forms/new/?table=0x800000000000003D&form=0x800000000000004D&command=0x8000000000000C2D";
AfterLoginData = null
LanguageDictionary = {};
LanguageDictionary.TC_COMMON = {"OkButtonTextOK":"Ok","OkButtonTextContinue":"Continue","OkButtonTextYes":"Yes","OkButtonTextDelete":"Clear","CancelButtonTextCancel":"Cancel","CancelButtonTextNo":"No","CancelButtonTextLogout":"Logout","MiddleButtonTextNo":"No","AjaxRequestError":"The Web server does not respond currently. Please try again later.","UserIdleMessage":"You are innactive, do you want to continue or you disconnect?","ErrorTitle":"Error","ErrorHeaderTitle":"Application error","ErrorHeaderText":"An application error has occurred while processing the current request. The error was recorded and sent to the site administrator. Provide your administrator ID error below.","ErrorMessage":"Message:","ErrorIdentifier":"Identify:","ErrorDate":"Date:"}
LanguageDictionary.TC_SEARCH = {"OperatorNotEqual":"Not =","OperatorIsDefined":"Is Defined","OperatorIsNotDefined":"Is Not Defined","OperatorContains":"Contains","OperatorDoesNotContain":"Does not contain","OperatorBeginsWith":"Begins with","OperatorDoesNotBeginWith":"Does not begin with","OperatorIsEmpty":"Is Empty","OperatorIsNotEmpty":"Is not empty","CustomFiltersNotComplete":"One or more custom filters are not complete. Examine each custom filter and make sure that the valid search criteria are provided.","NavigateAwayFromSearchWithFilterSet":"You are about to leave this page without performing the search filters custom.","NoGlobalSearchPermissions":"Password","SearchDefinitionLostAlert":"The definition of research will be lost if the primary table is changed. Are you sure you want to change the primary table of the research."}
LanguageDictionary.TC_FORM = {"RequiredFieldsNotSet":"Unable to save the form data. Provide a value for all required fields.","NavigateAwayFromUnsavedForm":"You are about to exit the form without saving it","RefreshFormLosesModifiedData":"The data of the form has changed. The changes you made will be lost when you refresh the form. Do you want to continue?","SaveDataBeforeClose":"The data of the form has changed. Do you want to save them before closing?","DeleteWarning":"The form data will be deleted. Are you sure you want to continue?","DeleteSecondaryWarning":"You are about to delete the form data.","RequiredField":"This is a required field","InvalidFormat":"The format for this field is not valid"}
LanguageDictionary.TC_GLOBALSEARCH = {"CollapseAllLabel":"Reduce everything","ExpandAllLabel":"About expand"}
LanguageDictionary.TC_WIDGETS = {"CallListItem":"Appeal","FaxListItem":"Fax","SmsListItem":"SMS"}
</script>
<script src="/Thinclient/Scripts/jquery-1.11.1.min.js" type="text/javascript">
</script>
<script src="/Thinclient/Scripts/jquery-migrate-1.2.1.min.js" type="text/javascript">
</script>
<script src="/Thinclient/Scripts/icheck.min.js">
</script>
<script src="/Thinclient/Scripts/kendo/kendo.all.min.js">
</script>
<script src="/Thinclient/Scripts/kendo/kendo.timezones.min.js">
</script>
<script src="/Thinclient/Scripts/kendo/kendo.aspnetmvc.min.js">
</script>
<script src="/Thinclient/Scripts/kendo/cultures/kendo.culture.en-US.min.js">
</script>
<script>
kendo.culture("en-US");
</script>
</head>
<body class="k-content">
<div class="k-loading-mask" id="loadingMsg" style="width:100%;height:100%">
<span class="k-loading-text">
Loading...
</span>
<div class="k-loading-image">
<div class="k-loading-color">
</div>
</div>
</div>
<input id="hdPollingFrequency" type="hidden" value="32767"/>
<input id="hdPrivateComputerTimeout" type="hidden" value="32767"/>
<input id="hdPublicComputerTimeout" type="hidden" value="32767"/>
<input id="hdWarningDisplayDuration" type="hidden" value="0"/>
<input id="hdWindowsAuthentication" type="hidden" value="false"/>
<div class="container">
<div id="content">
</div>
</div>
<div id="loading" style="display: none;">
<h1>
We are processing your request. Please be patient.
</h1>
<input class="abortButton" type="button" value="Abort"/>
</div>
<script id="taskpadGroupTmpl" type="text/x-jquery-tmpl">
<div class="panelBlock">
<div class="panelTitle"><div class="panelLink"><a class="panelDD-dn" id="${DisplayName}" href="#">${DisplayName}</a></div><div class = "imgPanel">
<a class="imgPanelDD" href="#"> </a></div>
</div>
<div class="panelContent1" id="panelContent1 + ${DisplayName}">
<ul>
{{tmpl(TaskItemCollection) "#taskpadItemTmpl"}}
</ul>
</div>
</div>
</script>
<script id="KendoTestTemplate" type="text/x-kendo-template">
<h2>#= test #</h2>
<ul>
#= kendo.render(kendo.template($("\\#KendoTestLiTemplate").html()), litest) #
</ul>
</script>
<script id="KendoTestLiTemplate" type="text/x-kendo-template">
<li>#= displayName#</li>
</script>
<script id="ErrorTemplate" type="text/x-jquery-tmpl">
<div class="errorMsg k-widget k-notification k-notification-error " data-role="alert" style="display: block; opacity: 1;">
<div class="k-notification-wrap">
<span class="k-icon k-i-note">
error
</span>
${errorMsg}
<span class="k-icon k-i-close">
Hide
</span>
</div>
</div>
</script>
<script id="HelpButtonTemplate" type="text/x-jquery-tmpl">
<button class="k-button k-primary helpButton" id="${id}" onclick="return false;">?</button>
</script>
<script id="IconTemplate" type="text/x-jquery-tmpl">
<span class="k-icon ${icon}"></span>
</script>
<script id="trash" type="text/x-kendo-template">
<li style="background: url(./Images/#=item.ImageId#.#=item.ImageHash#.#=item.ImageFileExtension#) no-repeat;"><a href="#=item.ActionCommand#" #if (item.ShowInNewWindow){# target="_blank" #}# class="#if (!item.ShowInNewWindow){# ajax-links #} if (item.ContentType == 'Email'){# mailto-links #}# linkTaskItem">#=item.DisplayName#</a></li>
<li class="#=GetCssClass(item.ContentType)#"><a href="#=item.ActionCommand#" #if (item.ShowInNewWindow){# target="_blank" #}# class="#if (!item.ShowInNewWindow){# ajax-links #} if (item.ContentType == 'Email'){# mailto-links #}# linkTaskItem">#=item.DisplayName#</a></li>
</script>
<script id="taskpadItemTmpl" type="text/x-jquery-tmpl">
{{if ImageId}}
<li style="background: url(./Images/${ImageId}.${ImageHash}.${ImageFileExtension}) no-repeat;"><a href="${ActionCommand}" {{if ShowInNewWindow}} target="_blank" {{/if}} class="{{if !ShowInNewWindow}} ajax-links {{/if}} {{if (ContentType == 'Email')}} mailto-links {{/if}} linkTaskItem">${DisplayName}</a></li>
{{else}}
<li class="${GetCssClass(ContentType)}"><a href="${ActionCommand}" {{if ShowInNewWindow}} target="_blank" {{/if}} class="{{if !ShowInNewWindow}} ajax-links {{/if}} {{if (ContentType == 'Email')}} mailto-links {{/if}} linkTaskItem">${DisplayName}</a></li>
{{/if}}
</script>
<script id="buttonBarButtonTmpl" type="text/x-jquery-tmpl">
<button value="submit" class="submitBtn k-button k-primary" data-actionCommand="${Action}" data-Disabled="${Disabled}" data-Visible="${Visible}" data-Name="${Name}">
<span>${DisplayName}</span>
</button>
</script>
<script src="/Thinclient/Scripts/jquery.filedownload.150420151637.js" type="text/javascript">
</script>
<script src="/Thinclient/Scripts/jquery.tmpl.min.150420151637.js" type="text/javascript">
</script>
<script src="/Thinclient/Scripts/pubsub.150420151641.js" type="text/javascript">
</script>
<script src="/Thinclient/Scripts/jquery.form.150420151637.js" type="text/javascript">
</script>
<script src="/Thinclient/Scripts/bootstrap.min.150420151637.js" type="text/javascript">
</script>
<script src="/Thinclient/Scripts/sameheight.min.150420151641.js" type="text/javascript">
</script>
<script src="/Thinclient/Scripts/Core.141120161617.js" type="text/javascript">
</script>
<script src="/Thinclient/Scripts/PivotalThinClient.150420151641.js" type="text/javascript">
</script>
</body>
</html>
来自浏览器源的结果
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>OCSWSSW | Member Search</title>
<link href="/Thinclient/favicon.ico" type="image/x-icon" rel="shortcut icon" />
<link href="/Thinclient/Content/bootstrap.071220161413.css" rel="stylesheet" type="text/css" />
<link rel="stylesheet" href="/Thinclient/Content/kendo/kendo.common-bootstrap.min.css" />
<link rel="stylesheet" href="/Thinclient/Content/kendo/kendo.bootstrap.min.css" />
<link href="/Thinclient/Content/ThinStyle.110820150951.css" rel="stylesheet" title="Blue" type="text/css" />
<link rel="stylesheet" href="//maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css">
<link rel="stylesheet" href="/Thinclient/Content/icheck/square/blue.css" />
<link rel="stylesheet" href="/Thinclient/Content/GlobalStyleSheet.css" />
<script type="text/javascript" >
HomeURL = "#/forms/new/?table=0x800000000000003D&form=0x800000000000004D&command=0x8000000000000C2D";
AfterLoginData = null
LanguageDictionary = {};
LanguageDictionary.TC_COMMON = {"OkButtonTextOK":"Ok","OkButtonTextContinue":"Continue","OkButtonTextYes":"Yes","OkButtonTextDelete":"Clear","CancelButtonTextCancel":"Cancel","CancelButtonTextNo":"No","CancelButtonTextLogout":"Logout","MiddleButtonTextNo":"No","AjaxRequestError":"The Web server does not respond currently. Please try again later.","UserIdleMessage":"You are innactive, do you want to continue or you disconnect?","ErrorTitle":"Error","ErrorHeaderTitle":"Application error","ErrorHeaderText":"An application error has occurred while processing the current request. The error was recorded and sent to the site administrator. Provide your administrator ID error below.","ErrorMessage":"Message:","ErrorIdentifier":"Identify:","ErrorDate":"Date:"}
LanguageDictionary.TC_SEARCH = {"OperatorNotEqual":"Not =","OperatorIsDefined":"Is Defined","OperatorIsNotDefined":"Is Not Defined","OperatorContains":"Contains","OperatorDoesNotContain":"Does not contain","OperatorBeginsWith":"Begins with","OperatorDoesNotBeginWith":"Does not begin with","OperatorIsEmpty":"Is Empty","OperatorIsNotEmpty":"Is not empty","CustomFiltersNotComplete":"One or more custom filters are not complete. Examine each custom filter and make sure that the valid search criteria are provided.","NavigateAwayFromSearchWithFilterSet":"You are about to leave this page without performing the search filters custom.","NoGlobalSearchPermissions":"Password","SearchDefinitionLostAlert":"The definition of research will be lost if the primary table is changed. Are you sure you want to change the primary table of the research."}
LanguageDictionary.TC_FORM = {"RequiredFieldsNotSet":"Unable to save the form data. Provide a value for all required fields.","NavigateAwayFromUnsavedForm":"You are about to exit the form without saving it","RefreshFormLosesModifiedData":"The data of the form has changed. The changes you made will be lost when you refresh the form. Do you want to continue?","SaveDataBeforeClose":"The data of the form has changed. Do you want to save them before closing?","DeleteWarning":"The form data will be deleted. Are you sure you want to continue?","DeleteSecondaryWarning":"You are about to delete the form data.","RequiredField":"This is a required field","InvalidFormat":"The format for this field is not valid"}
LanguageDictionary.TC_GLOBALSEARCH = {"CollapseAllLabel":"Reduce everything","ExpandAllLabel":"About expand"}
LanguageDictionary.TC_WIDGETS = {"CallListItem":"Appeal","FaxListItem":"Fax","SmsListItem":"SMS"}
</script>
<script type="text/javascript" src="/Thinclient/Scripts/jquery-1.11.1.min.js"></script>
<script src="/Thinclient/Scripts/jquery-migrate-1.2.1.min.js" type="text/javascript"></script>
<script src="/Thinclient/Scripts/icheck.min.js"></script>
<script src="/Thinclient/Scripts/kendo/kendo.all.min.js"></script>
<script src="/Thinclient/Scripts/kendo/kendo.timezones.min.js"></script>
<script src="/Thinclient/Scripts/kendo/kendo.aspnetmvc.min.js"></script>
<script src="/Thinclient/Scripts/kendo/cultures/kendo.culture.en-US.min.js"></script>
<script>
kendo.culture("en-US");
</script>
</head>
<body class="k-content">
<div id="loadingMsg" class="k-loading-mask" style="width:100%;height:100%">
<span class="k-loading-text">Loading...</span>
<div class="k-loading-image">
<div class="k-loading-color"></div>
</div>
</div>
<input type="hidden" ID="hdPollingFrequency" value="32767"/>
<input type="hidden" ID="hdPrivateComputerTimeout" value= "32767"/>
<input type="hidden" ID="hdPublicComputerTimeout" value="32767"/>
<input type="hidden" ID="hdWarningDisplayDuration" value="0"/>
<input type="hidden" id="hdWindowsAuthentication" value="false"/>
<div class="container">
<div id="content">
</div>
</div>
<div id="loading" style="display: none;">
<h1>
We are processing your request. Please be patient.</h1>
<input type="button" value="Abort" class="abortButton" />
</div>
<script id="taskpadGroupTmpl" type="text/x-jquery-tmpl">
<div class="panelBlock">
<div class="panelTitle"><div class="panelLink"><a class="panelDD-dn" id="${DisplayName}" href="#">${DisplayName}</a></div><div class = "imgPanel">
<a class="imgPanelDD" href="#"> </a></div>
</div>
<div class="panelContent1" id="panelContent1 + ${DisplayName}">
<ul>
{{tmpl(TaskItemCollection) "#taskpadItemTmpl"}}
</ul>
</div>
</div>
</script>
<script id="KendoTestTemplate" type="text/x-kendo-template">
<h2>#= test #</h2>
<ul>
#= kendo.render(kendo.template($("\\#KendoTestLiTemplate").html()), litest) #
</ul>
</script>
<script id="KendoTestLiTemplate" type="text/x-kendo-template">
<li>#= displayName#</li>
</script>
<script id="ErrorTemplate" type="text/x-jquery-tmpl">
<div class="errorMsg k-widget k-notification k-notification-error " data-role="alert" style="display: block; opacity: 1;">
<div class="k-notification-wrap">
<span class="k-icon k-i-note">
error
</span>
${errorMsg}
<span class="k-icon k-i-close">
Hide
</span>
</div>
</div>
</script>
<script id="HelpButtonTemplate" type="text/x-jquery-tmpl">
<button class="k-button k-primary helpButton" id="${id}" onclick="return false;">?</button>
</script>
<script id="IconTemplate" type="text/x-jquery-tmpl">
<span class="k-icon ${icon}"></span>
</script>
<script id="trash" type="text/x-kendo-template">
<li style="background: url(./Images/#=item.ImageId#.#=item.ImageHash#.#=item.ImageFileExtension#) no-repeat;"><a href="#=item.ActionCommand#" #if (item.ShowInNewWindow){# target="_blank" #}# class="#if (!item.ShowInNewWindow){# ajax-links #} if (item.ContentType == 'Email'){# mailto-links #}# linkTaskItem">#=item.DisplayName#</a></li>
<li class="#=GetCssClass(item.ContentType)#"><a href="#=item.ActionCommand#" #if (item.ShowInNewWindow){# target="_blank" #}# class="#if (!item.ShowInNewWindow){# ajax-links #} if (item.ContentType == 'Email'){# mailto-links #}# linkTaskItem">#=item.DisplayName#</a></li>
</script>
<script id="taskpadItemTmpl" type="text/x-jquery-tmpl">
{{if ImageId}}
<li style="background: url(./Images/${ImageId}.${ImageHash}.${ImageFileExtension}) no-repeat;"><a href="${ActionCommand}" {{if ShowInNewWindow}} target="_blank" {{/if}} class="{{if !ShowInNewWindow}} ajax-links {{/if}} {{if (ContentType == 'Email')}} mailto-links {{/if}} linkTaskItem">${DisplayName}</a></li>
{{else}}
<li class="${GetCssClass(ContentType)}"><a href="${ActionCommand}" {{if ShowInNewWindow}} target="_blank" {{/if}} class="{{if !ShowInNewWindow}} ajax-links {{/if}} {{if (ContentType == 'Email')}} mailto-links {{/if}} linkTaskItem">${DisplayName}</a></li>
{{/if}}
</script>
<script id="buttonBarButtonTmpl" type="text/x-jquery-tmpl">
<button value="submit" class="submitBtn k-button k-primary" data-actionCommand="${Action}" data-Disabled="${Disabled}" data-Visible="${Visible}" data-Name="${Name}">
<span>${DisplayName}</span>
</button>
</script>
<script src="/Thinclient/Scripts/jquery.filedownload.150420151637.js" type="text/javascript"></script>
<script src="/Thinclient/Scripts/jquery.tmpl.min.150420151637.js" type="text/javascript"></script>
<script src="/Thinclient/Scripts/pubsub.150420151641.js" type="text/javascript"></script>
<script src="/Thinclient/Scripts/jquery.form.150420151637.js" type="text/javascript"></script>
<script src="/Thinclient/Scripts/bootstrap.min.150420151637.js" type="text/javascript"></script>
<script src="/Thinclient/Scripts/sameheight.min.150420151641.js" type="text/javascript"></script>
<script src="/Thinclient/Scripts/Core.141120161617.js" type="text/javascript"></script>
<script src="/Thinclient/Scripts/PivotalThinClient.150420151641.js" type="text/javascript"></script>
</body>
</html>
<script>
//$( window ).load(
//$(".k-state-default").hover(function () {
// $(this).toggleClass("k-state-hover");
//})
//);
</script>
这不是您在Beautiful Soup中想要的东西吗?
答案 1 :(得分:0)
该页面是使用javascript创建的。 因此,仅使用request / bs4
不能获得页面源如何隐瞒:使用HeadlessChrome创建由javascript创建的页面源
答案 2 :(得分:0)
它不是动态页面(Ajax),您不能使用bs4
,如果您不喜欢硒元素,可以在浏览器弹出窗口中添加--headless
选项以将其隐藏。这里的例子
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.add_argument('--headless')
#options.add_argument('--disable-gpu') # maybe needed if running on Windows.
driver = webdriver.Chrome(chrome_options=options)
print("Loading Page...")
driver.get('https://onlineservices.ocswssw.org/Thinclient/Public/PR/EN/')
# wait max 20 second until ajax content rendered
print("Wait Ajax finished...")
WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID , 'MainForm')))
html = driver.execute_script("return document.documentElement.outerHTML")
Soup = BeautifulSoup(html, 'html.parser')
with open('ocswssw.html', 'w') as f:
sourceCode = Soup.prettify().encode('utf-8')
f.write(sourceCode)
print(sourceCode)
driver.quit()