Requests.get显示与Chrome的开发者工具

时间:2017-09-01 08:18:22

标签: python html web-scraping python-requests jupyter-notebook

我正在使用python(特别是jupyter笔记本)的网络抓取工具,它可以删除一些房地产页面并保存价格,地址等数据。

我选择的其中一个页面工作正常,但是当我尝试抓取此页面时:sreality.cz(对不起,页面是捷克语,但实际内容现在并不重要)使用reguests .get()我得到了这个结果:

<!doctype html>
<html lang="{{ html.lang }}" ng-app="sreality" ng-controller="MainCtrl">
<head>
	<meta charset="utf-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui">

	<!--- Nastaveni meta pres JS a ne pres Angular, aby byla nastavena default hodnota pro agenty co nezvladaji PhantomJS --->
	<title ng:bind-template="{{metaSeo.title}}">Sreality.cz • reality a nemovitosti z celé ČR</title>
	<meta name="description" content="Největší nabídka nemovitostí v ČR. Nabízíme byty, domy, novostavby, nebytové prostory, pozemky a další reality k prodeji i pronájmu. Sreality.cz">
	<meta property="og:title"       content="Sreality.cz • reality a nemovitosti z celé ČR">
	<meta property="og:type"        content="website">
	<meta property="og:image"       content="https://www.sreality.cz/img/sreality-logo-og.png">
	<meta property="og:description" content="Největší nabídka nemovitostí v ČR. Nabízíme byty, domy, novostavby, nebytové prostory, pozemky a další reality k prodeji i pronájmu. Sreality.cz">
	<meta property="og:url"         content="https://www.sreality.cz/">

	<meta ng-if="metaStatus.value" name="szn:status" content="{{metaStatus.value}}">

	<meta http-equiv="imagetoolbar" content="no">

	<link rel="icon" sizes="16x16 32x32 48x48 64x64" href="/img/icons/favicon.ico">
	<link rel="apple-touch-icon" sizes="57x57" href="/img/icons/apple-touch-icon-57x57.png?3">
	<link rel="apple-touch-icon" sizes="60x60" href="/img/icons/apple-touch-icon-60x60.png?3">
	<link rel="apple-touch-icon" sizes="72x72" href="/img/icons/apple-touch-icon-72x72.png?3">
	<link rel="apple-touch-icon" sizes="76x76" href="/img/icons/apple-touch-icon-76x76.png?3">
	<link rel="apple-touch-icon" sizes="114x114" href="/img/icons/apple-touch-icon-114x114.png?3">
	<link rel="apple-touch-icon" sizes="120x120" href="/img/icons/apple-touch-icon-120x120.png?3">
	<link rel="apple-touch-icon" sizes="144x144" href="/img/icons/apple-touch-icon-144x144.png?3">
	<link rel="apple-touch-icon" sizes="152x152" href="/img/icons/apple-touch-icon-152x152.png?3">
	<link rel="apple-touch-icon" sizes="180x180" href="/img/icons/apple-touch-icon-180x180.png?3">
	<link rel="icon" type="image/png" sizes="192x192"  href="/img/icons/android-chrome-192x192.png">
	<link rel="icon" type="image/png" sizes="32x32" href="/img/icons/favicon-32x32.png">
	<link rel="icon" type="image/png" sizes="96x96" href="/img/icons/favicon-96x96.png">
	<link rel="icon" type="image/png" sizes="16x16" href="/img/icons/favicon-16x16.png">
	<link rel="manifest" href="/img/icons/android-chrome-manifest.json">
	<meta name="msapplication-TileColor" content="#2b5797">
	<meta name="msapplication-TileImage" content="/img/icons/ms-icon-144x144.png">
	<meta name="msapplication-config" content="/img/icons/browserconfig.xml" />

	<link rel="alternate" type="application/rss+xml" ng-href="{{ rss.url }}" ng-if="rss.url">
	<link ng-repeat="lang in metaSeo.languages" rel="alternate" hreflang="{{lang.code}}" ng-href="{{lang.url}}">

	<link rel="stylesheet" href="/css/all.css?2e96626">

	<!-- Begin Inspectlet Embed Code -->
	<script type="text/javascript" id="inspectletjs">
	window.__insp = window.__insp || [];
	__insp.push(['wid', 821249485]);
	__insp.push(["virtualPage"]);
	(function() {
	function ldinsp(){if(typeof window.__inspld != "undefined") return; window.__inspld = 1; var insp = document.createElement('script'); insp.type = 'text/javascript'; insp.async = true; insp.id = "inspsync"; insp.src = ('https:' == document.location.protocol ? 'https' : 'http') + '://cdn.inspectlet.com/inspectlet.js'; var x = document.getElementsByTagName('script')[0]; x.parentNode.insertBefore(insp, x); };
	setTimeout(ldinsp, 500); document.readyState != "complete" ? (window.attachEvent ? window.attachEvent('onload', ldinsp) : window.addEventListener('load', ldinsp, false)) : ldinsp();
	})();
	</script>
	<!-- End Inspectlet Embed Code -->

	<!--[if lte IE 8]>
		<script>
			document.createElement('popover');
			document.createElement('mortgage');
			document.createElement('vendor');
			document.createElement('hp-signpost');
			document.createElement('category-switcher');
			document.createElement('feedback');
			document.createElement('bottom');
			document.createElement('panorama');
			document.createElement('panorama-prev');
			document.createElement('sphere-viewer');
			document.createElement('sphere-viewer-prev');
			document.createElement('save-filter');
		</script>
    <![endif]-->

	<!-- Statistiky -->
	<script src="https://h.imedia.cz/js/dot-small.js" type="text/javascript"></script>
	<script type="text/javascript">
		(function() {
			try {
				// Při přesměrování na hashbang URL (IE8-9) ztrácíme referrer,
				// který je potřeba pro správné počítání statistik.
				if (window.sessionStorage) { // někdo může mít DOM storage zakázaný
					var l = document.createElement('a');
					l.href = document.referrer;
					var referrerHostname = l.hostname;

					if (window.location.hostname != referrerHostname) {
						window.sessionStorage.setItem('referrer', l.href);
					}
				}

				// Starý android (< 4.0) v kombinaci s angularem špatně pracuje s hashem v URL.
				// Považuje ho za součást query případně path.
				// Na takových zařízech se budeme tvářit, že žádný hash nebyl.
				if (parseInt((/android (\d+)/.exec(window.navigator.userAgent.toLowerCase()) || [])[1], 10) < 4) {
					var hrefWithoutHashbang = window.location.href.replace('/#!', '');
					var hashIndex = hrefWithoutHashbang.indexOf('#');
					if (hashIndex != -1) {
						window.location.replace(hrefWithoutHashbang.substring(0, hashIndex));
					}
				}
			} catch (e) {}
		})();
	</script>

	<!-- API mapy.cz -->
	<script type="text/javascript" src="https://api4.mapy.cz/loader.js"></script>
	<script type="text/javascript">Loader.load(null, {poi: true, pano: true})</script>

	<!-- Login reklama -->
	<script src="https://i.imedia.cz/js/im3.js" type="text/javascript"></script>

	<script src="https://1.im.cz/software/promo/promo-sbrowser.js"></script>

	<!-- Rozkopírování SID cookie -->
	<script src="https://h.imedia.cz/js/sid.js"></script>

	<!-- Login -->
	<script src="https://login.szn.cz/js/api/login.js"></script>
	<script>
		login.cfg({
			serviceId: "sreality"
		});
	</script>

	<!-- KONFIGURACE -->
	<script src="/js/conf/config.js?2e96626"></script>

	<script src="/js/advert.js"></script>
	<script src="/js/all.js?2e96626"></script>

	<script type="text/javascript">
		if (window.DOT) {
			var dotCfg = {
				service: 'sreality'
			};
			if (window.SrealityABTest && window.SrealityABTest.getVariant()) {
				dotCfg.abtest = window.SrealityABTest.getVariant();
			}
			DOT.cfg(dotCfg);
		}
	</script>

	<noscript>
		<meta http-equiv="refresh" content="0;url=?_escaped_fragment_="/>
	</noscript>
	<meta name="fragment" content="!" ng-if="metaSeo.showMetaFragment" />

</head>
<!--[if IE 8]>    <body class="ie8"> <![endif]-->
<!--[if IE 9]>    <body class="notie8 ie9"> <![endif]-->
<!--[if gt IE 9]><!-->
<body class="notie8 notie9 lang-{{html.lang}}">
<!--<![endif]-->
	<div loading-line></div>

	<div page-layout>
		<div ng-view></div>
	</div>
</body>
</html>

虽然它与我在Chrome的开发者工具中查看页面时看到的不同 - 代码的一部分在这里(整个代码不适合这里,而且uploadtext由于某种原因不起作用) :

<!DOCTYPE html>
<html lang="cs" ng-app="sreality" ng-controller="MainCtrl" class="ng-scope"><head><style type="text/css">@charset "UTF-8";[ng\:cloak],[ng-cloak],[data-ng-cloak],[x-ng-cloak],.ng-cloak,.x-ng-cloak,.ng-hide{display:none !important;}ng\:form{display:block;}.ng-animate-block-transitions{transition:0s all!important;-webkit-transition:0s all!important;}.ng-hide-add-active,.ng-hide-remove{display:block!important;}</style>
	<meta charset="utf-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui">

	<!--- Nastaveni meta pres JS a ne pres Angular, aby byla nastavena default hodnota pro agenty co nezvladaji PhantomJS --->
	<title ng:bind-template="Byty na prodej Brno-město, posledních 30 dní • Sreality.cz" class="ng-binding">Byty na prodej Brno-město, posledních 30 dní • Sreality.cz</title>
	<meta name="description" content="284 realit v nabídce prodej bytů Brno-město s požadavky: posledních 30 dní. Vyberte si novou nemovitost na sreality.cz s hledáním na mapě a velkými náhledy fotografií nabízených bytů.">
	<meta property="og:title" content="Byty na prodej Brno-město, posledních 30 dní">
	<meta property="og:type" content="website">
	<meta property="og:image" content="https://www.sreality.cz/img/sreality-logo-og.png">
	<meta property="og:description" content="284 realit v nabídce prodej bytů Brno-město s požadavky: posledních 30 dní. Vyberte si novou nemovitost na sreality.cz s hledáním na mapě a velkými náhledy fotografií nabízených bytů.">
	<meta property="og:url" content="https://www.sreality.cz/hledani/prodej/byty/brno?stari=mesic">

	<!-- ngIf: metaStatus.value --><meta ng-if="metaStatus.value" name="szn:status" content="200" class="ng-scope"><!-- end ngIf: metaStatus.value -->

	<meta http-equiv="imagetoolbar" content="no">

	<link rel="icon" sizes="16x16 32x32 48x48 64x64" href="/img/icons/favicon.ico">
	<link rel="apple-touch-icon" sizes="57x57" href="/img/icons/apple-touch-icon-57x57.png?3">
	<link rel="apple-touch-icon" sizes="60x60" href="/img/icons/apple-touch-icon-60x60.png?3">
	<link rel="apple-touch-icon" sizes="72x72" href="/img/icons/apple-touch-icon-72x72.png?3">
	<link rel="apple-touch-icon" sizes="76x76" href="/img/icons/apple-touch-icon-76x76.png?3">
	<link rel="apple-touch-icon" sizes="114x114" href="/img/icons/apple-touch-icon-114x114.png?3">
	<link rel="apple-touch-icon" sizes="120x120" href="/img/icons/apple-touch-icon-120x120.png?3">
	<link rel="apple-touch-icon" sizes="144x144" href="/img/icons/apple-touch-icon-144x144.png?3">
	<link rel="apple-touch-icon" sizes="152x152" href="/img/icons/apple-touch-icon-152x152.png?3">
	<link rel="apple-touch-icon" sizes="180x180" href="/img/icons/apple-touch-icon-180x180.png?3">
	<link rel="icon" type="image/png" sizes="192x192" href="/img/icons/android-chrome-192x192.png">
	<link rel="icon" type="image/png" sizes="32x32" href="/img/icons/favicon-32x32.png">
	<link rel="icon" type="image/png" sizes="96x96" href="/img/icons/favicon-96x96.png">
	<link rel="icon" type="image/png" sizes="16x16" href="/img/icons/favicon-16x16.png">
	<link rel="manifest" href="/img/icons/android-chrome-manifest.json">
	<meta name="msapplication-TileColor" content="#2b5797">
	<meta name="msapplication-TileImage" content="/img/icons/ms-icon-144x144.png">
	<meta name="msapplication-config" content="/img/icons/browserconfig.xml">
<!-- ngIf: rss.url --><link rel="alternate" type="application/rss+xml" ng-href="/api/cs/v2/estates/rss?category_main_cb=1&amp;locality_district_id=72&amp;suggested_regionId=-1&amp;suggested_districtId=-1&amp;estate_age=31&amp;locality_region_id=14&amp;category_type_cb=1" ng-if="rss.url" class="ng-scope" href="/api/cs/v2/estates/rss?category_main_cb=1&amp;locality_district_id=72&amp;suggested_regionId=-1&amp;suggested_districtId=-1&amp;estate_age=31&amp;locality_region_id=14&amp;category_type_cb=1"><!-- end ngIf: rss.url -->

我可以从第一个html代码中看到,request.get下载该页面运行的某些脚本可能导致html不同。

我已经尝试过使用urllib,但结果html doc仍然是一样的。

有没有办法下载我在Chromes的开发者工具中打开页面时看到的html,以便我可以抓住它?

1 个答案:

答案 0 :(得分:1)

如果最终来自该页面的数据,您可以使用selenium与BeautifulSoup结合使用。它为您提供了公寓的所有链接。

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()

driver.get("https://www.sreality.cz/hledani/prodej/byty/brno?stari=mesic")
soup = BeautifulSoup(driver.page_source,"html.parser")
driver.quit()

for title in soup.select(".text-wrap"):
    num = "https://www.sreality.cz" + title.select_one(".title").get('href')
    print(num)