Python 3.5 beautifulsoup无法阅读页面

时间:2017-07-16 00:23:48

标签: beautifulsoup python-3.5

当我完成以下过程时:

以上步骤将我带到以下网址:  http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=228792

您可以在其中查看数据。

但是,如果我使用以下代码:

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
soup = BeautifulSoup(urlopen(url).read())
print soup

我收到错误:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
<title>Travis Property Search</title>
<style type="text/css">
      body { text-align: center; padding: 150px; }
      h1 { font-size: 50px; }
      body { font: 20px Helvetica, sans-serif; color: #333; }
      #article { display: block; text-align: left; width: 650px; margin: 0 auto; }
      a { color: #dc8100; text-decoration: none; }
      a:hover { color: #333; text-decoration: none; }
    </style>
</head>
<body>
<div id="article">
<h1>Please try again</h1>
<div>
<p>Sorry for the inconvenience but your session has either timed out or the server is busy handling other requests. You may visit us on the the following website for information, otherwise please retry your search again shortly:<br /><br />
<a href="http://www.traviscad.org/">Travis Central Appraisal District Website</a> </p>
<p><b><a href="http://propaccess.traviscad.org/clientdb/?cid=1">Click here to reload the property search to try again</a></b></p>
</div>
</div>
</body>
</html>

我尝试了其他导入cookie的方法,但是我无法使用python读取数据。

1 个答案:

答案 0 :(得分:1)

尝试这样的事情:

import requests
from bs4 import BeautifulSoup

s = requests.session()
r = s.get('http://propaccess.traviscad.org/clientdb/?cid=1')
r2 = s.get('http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669')

soup = BeautifulSoup(r2.text, 'html.parser')
print(soup.prettify())

这将获取建立会话的页面,requests.session将保存会话数据。在下一个请求中,它将使用会话cookie并获取您的文本。您应该能够将该文本传递给BeautifulSoup进行解析。