Question

我使用urllib.request.urlopen

阅读了一个网页

import urllib.request
import shutil

my_response = urllib.request.urlopen('https://google.com') # object of HTTPResponse type

然后，我希望将它们保存为文件并使用该变量进行代码中的后续处理。例如，如果我尝试以下内容：

shutil.copyfileobj(my_response, open('gpage.html', 'wb')) # saved successfully
my_content = my_response.read() # empty

文件已成功保存，但my_response在此之后变为空反之亦然，如果我先调用.read()，我可以获取内容，但保存的文件将为空：

my_content = my_response.read() # works as expected
shutil.copyfileobj(my_response, open('gpage.html', 'wb')) # empty file

即。我只能访问my_content一次。我记得这种行为是一些其他类型的python对象（所有迭代器？）的典型行为，但不确定它的正确用语是什么。在我的情况下，如果我想要将内容写入文件并将其保存在变量中，那么建议的解决方案是什么？（到目前为止，我使用解决方法写入文件，然后阅读它）

Answer 1

这是任何缓冲区的正常行为（在此示例中，它是缓冲读取器），相反的是从流中读取（流读取器）。您可以通过先将其写入变量并对该变量进行操作来轻松绕过它：

my_content = my_response.read() # read from buffer and store in variable
with open('gpage.html', 'wb') as fp: 
    fp.write(my_content) # use the variable instead of the reader again
# do more stuff with my_content

如果您使用其中的数据来为更多数据腾出空间，缓冲区将被清空。在这种情况下，shutils.copyfileobj也会在对象上调用.read()，因此只有第一个获取缓冲区中的内容。

另外：urllibb.request的文档建议打开网址，就像任何其他资源一样：

with open(urllib.request.urlopen('https://google.com')) as request:
    my_content = request.read()

这样，在从缓冲区读取所有内容后，资源会再次被直接释放，并且只要with ...:范围结束，您就会消耗更少的内存。

这将共同构成：

my_content = ""
with open(urllib.request.urlopen('https://google.com')) as request:
    my_content = request.read()
with open('gpage.html', 'wb') as fp: 
    fp.write(my_content)

读取HTTPResponse对象“清空”它

1 个答案: