Question

我正在尝试从祖先那里获取一些数据，我具有.net背景，但认为我会为项目尝试一些python。我正处于第一步，首先，我试图打开此页面，然后仅打印出行。

from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

raw_html = open('https://www.ancestry.co.uk/search/collections/britisharmyservice/? 
birth=_merthyr+tydfil-wales-united+kingdom_1651442').read()
html = BeautifulSoup(raw_html, 'html.parser')
for p in html.select('tblrow record'):
    print(p)

我在公开场合收到非法论点。

Answer 1

根据文档，open用于：

打开[a]文件并返回相应的文件对象。

因此，您不能将其用于下载网页的HTML内容。您可能打算按以下方式使用requests.get：

raw_html = get('https://www.ancestry.co.uk/search/collections/britisharmyservice/? 
birth=_merthyr+tydfil-wales-united+kingdom_1651442').text
# .text gets the raw text of the response 
# (http://docs.python-requests.org/en/master/api/#requests.Response.text)

以下是一些改善代码的建议：

requests.get提供了许多有用的参数，其中一个是params，它允许您以Python字典的形式提供URL参数。
如果您需要在访问其文本之前验证请求是否成功，则只需检查返回的response.status_code == requests.codes.ok是否有效。这仅涵盖状态码200，但是如果您需要更多代码，那么response.raise_for_status应该会有所帮助。

用于Web抓取的Open方法中的无效参数

1 个答案: