我已在C:\ webpage.htm位置保存了一个网页。我想加载它并使用BeautifulSoup进行分析,但是urllib不会打开它。
from BeautifulSoup import BeautifulSoup
import urllib2
url="C:\webpage.htm"
page=urllib2.urlopen(url)
这会引发错误:
Traceback (most recent call last):
page=urllib2.urlopen(url)
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 400, in open
response = self._open(req, data)
File "C:\Python27\lib\urllib2.py", line 423, in _open
'unknown_open', req)
File "C:\Python27\lib\urllib2.py", line 378, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 1240, in unknown_open
raise URLError('unknown url type: %s' % type)
urllib2.URLError: <urlopen error unknown url type: c>
我如何解决这个问题,或者是否有另一种方法可以将文档加载到漂亮的汤中(我曾尝试将其保存为文本文档,但却引发了错误:
'str' object has no attribute 'findall'
答案 0 :(得分:3)
似乎你必须指定协议。在这种情况下,您可能想要做的是:
from BeautifulSoup import BeautifulSoup
import urllib2
url="file:///C:/webpage.html"
page=urllib2.urlopen(url)
答案 1 :(得分:3)
由于您要从本地计算机上加载文件,因此无需使用urllib2。相反,你可以使用Python的标准文件I / O函数:open(),read()和close()
from BeautifulSoup import BeautifulSoup
url="C:\webpage.htm"
f = open(url)
# read entire file as a string
page=f.read()
soup=BeautifulSoup(page)
# etc...
f.close()