Question

当我完成以下过程时：

在浏览器中打开链接：http://propaccess.traviscad.org/clientdb/?cid=1
在属性搜索框中输入：Jim并点击搜索
点击第一个结果的查看详细信息列

以上步骤将我带到以下网址： http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=228792

您可以在其中查看数据。

但是，如果我使用以下代码：

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
soup = BeautifulSoup(urlopen(url).read())
print soup

我收到错误：

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
<title>Travis Property Search</title>
<style type="text/css">
      body { text-align: center; padding: 150px; }
      h1 { font-size: 50px; }
      body { font: 20px Helvetica, sans-serif; color: #333; }
      #article { display: block; text-align: left; width: 650px; margin: 0 auto; }
      a { color: #dc8100; text-decoration: none; }
      a:hover { color: #333; text-decoration: none; }
    </style>
</head>
<body>
<div id="article">
<h1>Please try again</h1>
<div>
<p>Sorry for the inconvenience but your session has either timed out or the server is busy handling other requests. You may visit us on the the following website for information, otherwise please retry your search again shortly:<br /><br />
<a href="http://www.traviscad.org/">Travis Central Appraisal District Website</a> </p>
<p><b><a href="http://propaccess.traviscad.org/clientdb/?cid=1">Click here to reload the property search to try again</a></b></p>
</div>
</div>
</body>
</html>

我尝试了其他导入cookie的方法，但是我无法使用python读取数据。

Answer 1

尝试这样的事情：

import requests
from bs4 import BeautifulSoup

s = requests.session()
r = s.get('http://propaccess.traviscad.org/clientdb/?cid=1')
r2 = s.get('http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669')

soup = BeautifulSoup(r2.text, 'html.parser')
print(soup.prettify())

这将获取建立会话的页面，requests.session将保存会话数据。在下一个请求中，它将使用会话cookie并获取您的文本。您应该能够将该文本传递给BeautifulSoup进行解析。

Python 3.5 beautifulsoup无法阅读页面

1 个答案: