我正在使用python(特别是jupyter笔记本)的网络抓取工具,它可以删除一些房地产页面并保存价格,地址等数据。
我选择的其中一个页面工作正常,但是当我尝试抓取此页面时:sreality.cz(对不起,页面是捷克语,但实际内容现在并不重要)使用reguests .get()我得到了这个结果:
<!doctype html>
<html lang="{{ html.lang }}" ng-app="sreality" ng-controller="MainCtrl">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui">
<!--- Nastaveni meta pres JS a ne pres Angular, aby byla nastavena default hodnota pro agenty co nezvladaji PhantomJS --->
<title ng:bind-template="{{metaSeo.title}}">Sreality.cz • reality a nemovitosti z celé ČR</title>
<meta name="description" content="Největší nabídka nemovitostí v ČR. Nabízíme byty, domy, novostavby, nebytové prostory, pozemky a další reality k prodeji i pronájmu. Sreality.cz">
<meta property="og:title" content="Sreality.cz • reality a nemovitosti z celé ČR">
<meta property="og:type" content="website">
<meta property="og:image" content="https://www.sreality.cz/img/sreality-logo-og.png">
<meta property="og:description" content="Největší nabídka nemovitostí v ČR. Nabízíme byty, domy, novostavby, nebytové prostory, pozemky a další reality k prodeji i pronájmu. Sreality.cz">
<meta property="og:url" content="https://www.sreality.cz/">
<meta ng-if="metaStatus.value" name="szn:status" content="{{metaStatus.value}}">
<meta http-equiv="imagetoolbar" content="no">
<link rel="icon" sizes="16x16 32x32 48x48 64x64" href="/img/icons/favicon.ico">
<link rel="apple-touch-icon" sizes="57x57" href="/img/icons/apple-touch-icon-57x57.png?3">
<link rel="apple-touch-icon" sizes="60x60" href="/img/icons/apple-touch-icon-60x60.png?3">
<link rel="apple-touch-icon" sizes="72x72" href="/img/icons/apple-touch-icon-72x72.png?3">
<link rel="apple-touch-icon" sizes="76x76" href="/img/icons/apple-touch-icon-76x76.png?3">
<link rel="apple-touch-icon" sizes="114x114" href="/img/icons/apple-touch-icon-114x114.png?3">
<link rel="apple-touch-icon" sizes="120x120" href="/img/icons/apple-touch-icon-120x120.png?3">
<link rel="apple-touch-icon" sizes="144x144" href="/img/icons/apple-touch-icon-144x144.png?3">
<link rel="apple-touch-icon" sizes="152x152" href="/img/icons/apple-touch-icon-152x152.png?3">
<link rel="apple-touch-icon" sizes="180x180" href="/img/icons/apple-touch-icon-180x180.png?3">
<link rel="icon" type="image/png" sizes="192x192" href="/img/icons/android-chrome-192x192.png">
<link rel="icon" type="image/png" sizes="32x32" href="/img/icons/favicon-32x32.png">
<link rel="icon" type="image/png" sizes="96x96" href="/img/icons/favicon-96x96.png">
<link rel="icon" type="image/png" sizes="16x16" href="/img/icons/favicon-16x16.png">
<link rel="manifest" href="/img/icons/android-chrome-manifest.json">
<meta name="msapplication-TileColor" content="#2b5797">
<meta name="msapplication-TileImage" content="/img/icons/ms-icon-144x144.png">
<meta name="msapplication-config" content="/img/icons/browserconfig.xml" />
<link rel="alternate" type="application/rss+xml" ng-href="{{ rss.url }}" ng-if="rss.url">
<link ng-repeat="lang in metaSeo.languages" rel="alternate" hreflang="{{lang.code}}" ng-href="{{lang.url}}">
<link rel="stylesheet" href="/css/all.css?2e96626">
<!-- Begin Inspectlet Embed Code -->
<script type="text/javascript" id="inspectletjs">
window.__insp = window.__insp || [];
__insp.push(['wid', 821249485]);
__insp.push(["virtualPage"]);
(function() {
function ldinsp(){if(typeof window.__inspld != "undefined") return; window.__inspld = 1; var insp = document.createElement('script'); insp.type = 'text/javascript'; insp.async = true; insp.id = "inspsync"; insp.src = ('https:' == document.location.protocol ? 'https' : 'http') + '://cdn.inspectlet.com/inspectlet.js'; var x = document.getElementsByTagName('script')[0]; x.parentNode.insertBefore(insp, x); };
setTimeout(ldinsp, 500); document.readyState != "complete" ? (window.attachEvent ? window.attachEvent('onload', ldinsp) : window.addEventListener('load', ldinsp, false)) : ldinsp();
})();
</script>
<!-- End Inspectlet Embed Code -->
<!--[if lte IE 8]>
<script>
document.createElement('popover');
document.createElement('mortgage');
document.createElement('vendor');
document.createElement('hp-signpost');
document.createElement('category-switcher');
document.createElement('feedback');
document.createElement('bottom');
document.createElement('panorama');
document.createElement('panorama-prev');
document.createElement('sphere-viewer');
document.createElement('sphere-viewer-prev');
document.createElement('save-filter');
</script>
<![endif]-->
<!-- Statistiky -->
<script src="https://h.imedia.cz/js/dot-small.js" type="text/javascript"></script>
<script type="text/javascript">
(function() {
try {
// Při přesměrování na hashbang URL (IE8-9) ztrácíme referrer,
// který je potřeba pro správné počítání statistik.
if (window.sessionStorage) { // někdo může mít DOM storage zakázaný
var l = document.createElement('a');
l.href = document.referrer;
var referrerHostname = l.hostname;
if (window.location.hostname != referrerHostname) {
window.sessionStorage.setItem('referrer', l.href);
}
}
// Starý android (< 4.0) v kombinaci s angularem špatně pracuje s hashem v URL.
// Považuje ho za součást query případně path.
// Na takových zařízech se budeme tvářit, že žádný hash nebyl.
if (parseInt((/android (\d+)/.exec(window.navigator.userAgent.toLowerCase()) || [])[1], 10) < 4) {
var hrefWithoutHashbang = window.location.href.replace('/#!', '');
var hashIndex = hrefWithoutHashbang.indexOf('#');
if (hashIndex != -1) {
window.location.replace(hrefWithoutHashbang.substring(0, hashIndex));
}
}
} catch (e) {}
})();
</script>
<!-- API mapy.cz -->
<script type="text/javascript" src="https://api4.mapy.cz/loader.js"></script>
<script type="text/javascript">Loader.load(null, {poi: true, pano: true})</script>
<!-- Login reklama -->
<script src="https://i.imedia.cz/js/im3.js" type="text/javascript"></script>
<script src="https://1.im.cz/software/promo/promo-sbrowser.js"></script>
<!-- Rozkopírování SID cookie -->
<script src="https://h.imedia.cz/js/sid.js"></script>
<!-- Login -->
<script src="https://login.szn.cz/js/api/login.js"></script>
<script>
login.cfg({
serviceId: "sreality"
});
</script>
<!-- KONFIGURACE -->
<script src="/js/conf/config.js?2e96626"></script>
<script src="/js/advert.js"></script>
<script src="/js/all.js?2e96626"></script>
<script type="text/javascript">
if (window.DOT) {
var dotCfg = {
service: 'sreality'
};
if (window.SrealityABTest && window.SrealityABTest.getVariant()) {
dotCfg.abtest = window.SrealityABTest.getVariant();
}
DOT.cfg(dotCfg);
}
</script>
<noscript>
<meta http-equiv="refresh" content="0;url=?_escaped_fragment_="/>
</noscript>
<meta name="fragment" content="!" ng-if="metaSeo.showMetaFragment" />
</head>
<!--[if IE 8]> <body class="ie8"> <![endif]-->
<!--[if IE 9]> <body class="notie8 ie9"> <![endif]-->
<!--[if gt IE 9]><!-->
<body class="notie8 notie9 lang-{{html.lang}}">
<!--<![endif]-->
<div loading-line></div>
<div page-layout>
<div ng-view></div>
</div>
</body>
</html>
虽然它与我在Chrome的开发者工具中查看页面时看到的不同 - 代码的一部分在这里(整个代码不适合这里,而且uploadtext由于某种原因不起作用) :
<!DOCTYPE html>
<html lang="cs" ng-app="sreality" ng-controller="MainCtrl" class="ng-scope"><head><style type="text/css">@charset "UTF-8";[ng\:cloak],[ng-cloak],[data-ng-cloak],[x-ng-cloak],.ng-cloak,.x-ng-cloak,.ng-hide{display:none !important;}ng\:form{display:block;}.ng-animate-block-transitions{transition:0s all!important;-webkit-transition:0s all!important;}.ng-hide-add-active,.ng-hide-remove{display:block!important;}</style>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui">
<!--- Nastaveni meta pres JS a ne pres Angular, aby byla nastavena default hodnota pro agenty co nezvladaji PhantomJS --->
<title ng:bind-template="Byty na prodej Brno-město, posledních 30 dní • Sreality.cz" class="ng-binding">Byty na prodej Brno-město, posledních 30 dní • Sreality.cz</title>
<meta name="description" content="284 realit v nabídce prodej bytů Brno-město s požadavky: posledních 30 dní. Vyberte si novou nemovitost na sreality.cz s hledáním na mapě a velkými náhledy fotografií nabízených bytů.">
<meta property="og:title" content="Byty na prodej Brno-město, posledních 30 dní">
<meta property="og:type" content="website">
<meta property="og:image" content="https://www.sreality.cz/img/sreality-logo-og.png">
<meta property="og:description" content="284 realit v nabídce prodej bytů Brno-město s požadavky: posledních 30 dní. Vyberte si novou nemovitost na sreality.cz s hledáním na mapě a velkými náhledy fotografií nabízených bytů.">
<meta property="og:url" content="https://www.sreality.cz/hledani/prodej/byty/brno?stari=mesic">
<!-- ngIf: metaStatus.value --><meta ng-if="metaStatus.value" name="szn:status" content="200" class="ng-scope"><!-- end ngIf: metaStatus.value -->
<meta http-equiv="imagetoolbar" content="no">
<link rel="icon" sizes="16x16 32x32 48x48 64x64" href="/img/icons/favicon.ico">
<link rel="apple-touch-icon" sizes="57x57" href="/img/icons/apple-touch-icon-57x57.png?3">
<link rel="apple-touch-icon" sizes="60x60" href="/img/icons/apple-touch-icon-60x60.png?3">
<link rel="apple-touch-icon" sizes="72x72" href="/img/icons/apple-touch-icon-72x72.png?3">
<link rel="apple-touch-icon" sizes="76x76" href="/img/icons/apple-touch-icon-76x76.png?3">
<link rel="apple-touch-icon" sizes="114x114" href="/img/icons/apple-touch-icon-114x114.png?3">
<link rel="apple-touch-icon" sizes="120x120" href="/img/icons/apple-touch-icon-120x120.png?3">
<link rel="apple-touch-icon" sizes="144x144" href="/img/icons/apple-touch-icon-144x144.png?3">
<link rel="apple-touch-icon" sizes="152x152" href="/img/icons/apple-touch-icon-152x152.png?3">
<link rel="apple-touch-icon" sizes="180x180" href="/img/icons/apple-touch-icon-180x180.png?3">
<link rel="icon" type="image/png" sizes="192x192" href="/img/icons/android-chrome-192x192.png">
<link rel="icon" type="image/png" sizes="32x32" href="/img/icons/favicon-32x32.png">
<link rel="icon" type="image/png" sizes="96x96" href="/img/icons/favicon-96x96.png">
<link rel="icon" type="image/png" sizes="16x16" href="/img/icons/favicon-16x16.png">
<link rel="manifest" href="/img/icons/android-chrome-manifest.json">
<meta name="msapplication-TileColor" content="#2b5797">
<meta name="msapplication-TileImage" content="/img/icons/ms-icon-144x144.png">
<meta name="msapplication-config" content="/img/icons/browserconfig.xml">
<!-- ngIf: rss.url --><link rel="alternate" type="application/rss+xml" ng-href="/api/cs/v2/estates/rss?category_main_cb=1&locality_district_id=72&suggested_regionId=-1&suggested_districtId=-1&estate_age=31&locality_region_id=14&category_type_cb=1" ng-if="rss.url" class="ng-scope" href="/api/cs/v2/estates/rss?category_main_cb=1&locality_district_id=72&suggested_regionId=-1&suggested_districtId=-1&estate_age=31&locality_region_id=14&category_type_cb=1"><!-- end ngIf: rss.url -->
我可以从第一个html代码中看到,request.get下载该页面运行的某些脚本可能导致html不同。
我已经尝试过使用urllib,但结果html doc仍然是一样的。
有没有办法下载我在Chromes的开发者工具中打开页面时看到的html,以便我可以抓住它?
答案 0 :(得分:1)
如果最终来自该页面的数据,您可以使用selenium与BeautifulSoup结合使用。它为您提供了公寓的所有链接。
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.sreality.cz/hledani/prodej/byty/brno?stari=mesic")
soup = BeautifulSoup(driver.page_source,"html.parser")
driver.quit()
for title in soup.select(".text-wrap"):
num = "https://www.sreality.cz" + title.select_one(".title").get('href')
print(num)