我对Python非常非常新,并且我想尝试一些实际的应用程序。
我正在尝试使用请求库组合一个基本的Web价格刮刀。我选择了这个网页:https://www.usstoragecenters.com/storage-units/fl/north-miami-beach/15555-w-dixie-hwy
这是我使用的基本结构:
import requests
page = requests.get("my url from above")
page
page.content
但由于某种原因,通过.content或.text的html打印看起来非常错误。我没有看到html结构,而是看到了大量的回车。肯定缺少数据。
我尝试使用漂亮的汤(html-parser,html5lib等)进行解析,这样可以删除更多的数据。
这只是以阻止抓取的方式编码,还是我做错了?
答案 0 :(得分:1)
问题:
您遇到的问题是htmls中存在嵌入式javascript,因此您将在html页面中看到数据丢失。所以这里([requests_html])是一个非常好的库,旨在通过kennethreitz请求htmls
示例代码:
from requests_html import *
sessions = Session()
r = sessions.get('https://www.usstoragecenters.com/storage-units/fl/north-miami-beach/15555-w-dixie-hwy')
for lines in r.iter_lines() :
print(lines)
示例输出
由于评论大小限制我无法发布完整的html,这里是用上面打印的HTML片段
b'<!doctype html>'
b'<html>'
b'<head>'
b'<meta charset="utf-8">'
b'<title>Self Storage Units at 15555 West Dixie Highway, North Miami Beach, FL 33162 | US Storage Centers</title>'
b'<base href="/">'
b'<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />'
b'<meta name="description" content="Brand New Facility Grand Opening! Special 50% Off Self Storage. Friendly Service. Reserve Online for Free. No Credit Card Required." />'
b'<meta property="og:type" content="website" />'
b'<meta property="og:locale" content="en_US" />'
b'<meta property="og:site_name" content="US Storage Centers" />'
b'<meta property="og:title" content="Self Storage North Miami Beach" />'
b'<meta property="og:url" content="https://www.usstoragecenters.com/storage-units/fl/north-miami-beach/15555-w-dixie-hwy" />'
b'<meta property="og:description" content="Brand New Facility Grand Opening! Special 50% Off Self Storage. Friendly Service. Reserve Online for Free. No Credit Card Required." />'
b'<meta property="og:image" content="https://www.usstoragecenters.com/www/images/ussc_facility_photos/168/2017-06-15_00-37-08_Self%20Storage%20Building%20Exterior%20Front%20-%20North%20Miami%20Beach%20West%20Dixie%20IMG_5237%208.jpg" />'
b'<script type="application/ld+json">'
b' {'
b' "@context": "http://schema.org",'
b' "@type": "WebPage"'
b' ,"breadcrumb": {'
b' "@context": "http://schema.org",'
b' "@type": "BreadcrumbList",'
b' "itemListElement": [{'
b' "@type": "ListItem",'
b' "name": "US Storage Centers",'
b' "url": "https://www.usstoragecenters.com/",'
b' "position": 0'
b' }, {'
b' "@type": "ListItem",'
b' "name": "Storage Units",'
b' "url": "https://www.usstoragecenters.com/storage-units",'
b' "position": 1'
b' }, {'
b' "@type": "ListItem",'
b' "name": "FL",'
b' "url": "https://www.usstoragecenters.com/storage-units/fl",'
b' "position": 2'
b' }, {'
**...... truncated .....**
答案 1 :(得分:0)
致电print(page.content)
它将对应该出现的返回等进行编码(换行符,制表符等)
测试:
s = """
Hey
\r\r\r\r\r Look
\t\t\t\t\t\t Here"""
print(s)
输出:
Hey
Look
Here
答案 2 :(得分:0)
您在浏览器的开发者工具中看到的内容与网络服务器返回的HTML中的内容并不对应。查看网络浏览器中的源代码,您会看到网页内容的所有是使用<script>
标记中包含的JSON的JavaScript生成的。
这使您的生活变得更轻松,因为您不必担心解析HTML并且只从JSON中提取数据:
import json
from bs4 import BeautifulSoup
...
soup = BeautifulSoup(page.text)
# Find the `script` tag with no `src` and 'window.jsonData' in its text
script = soup.find('script', src=None, text=lambda text: 'window.jsonData' in text).get_text()
# The JSON is part of script, so just remove the extra stuff
script = script.strip().replace('window.jsonData = ', '').rstrip(';')
# Now parse it
data = json.loads(script)