Question

您好我正在阅读“使用Python进行Web Scraping（2015）”。我看到了以下两种打开网址的方法：使用和不使用.read()。请参阅bs1和bs2

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html')
bs1 = BeautifulSoup(html.read(), 'html.parser')

html = urlopen('http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html')
bs2 = BeautifulSoup(html, 'html.parser')

bs1 == bs2 # true


print(bs1.prettify()[0:100])
print(bs2.prettify()[0:100]) # prints same thing

.read()是多余的？感谢

使用python进行Web scpraing的p7代码:(使用.read()）

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read())

第15页的代码（不含.read()）

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html)

Answer 1

urllib.request.urlopen返回一个类似文件的对象，它的read方法将返回该网址的响应主体。

BeautifulSoup构造函数接受字符串或打开文件句柄，所以是的，read()在这里是多余的。

Answer 2

引用BS docs：

要解析文档，请将其传递给BeautifulSoup构造函数。您可以传入一个字符串或一个打开的文件句柄：

当您使用.read（）方法时，您使用＆＃34;字符串＆＃34; inteface。当你不是，你正在使用＆＃34; filehandle＆＃34;接口

实际上它的工作方式相同（尽管BS4可能会以懒惰的方式读取类似文件的对象）。在你的情况下，整个内容被读取到字符串对象（它可能会不必要地消耗更多的内存）。

Answer 3

没有BeautifulSoup模块

当您不使用“BeautifulSoup”模块时，

.read（）非常有用，因此在这种情况下使其成为非冗余模块。只有当你使用.read（）时，你才会得到html内容，如果没有它，你将只有.urlopen（）返回的对象

使用BeautifulSoup模块

BS模块有2个用于此功能的构造函数，一个将接受String，另一个将接受.urlopen返回的对象（ some-site ）

urlopen（'http .....'）中的read（）是什么.read（）吗？ [的urllib]

3 个答案: