我正在写一个小刮刀。这是迄今为止的代码。
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re
soup = BeautifulSoup(
urlopen('http://www.high-rely.com/HR3/includes/ProductFamily.php').read()
)
links = soup.findAll('a', 'visible_link')
hrefs = ['www.high-rely.com' + relative for relative in [x['href'] for x in links]]
subpages = map(BeautifulSoup, [urlopen(x).read() for x in hrefs])
当我运行它时,我收到以下错误。
Traceback (most recent call last):
File "C:/Users/josh.SCL/Desktop/Scraper.py", line 13, in <module>
subpages = map(BeautifulSoup, [urlopen(x).read() for x in hrefs])
File "C:\Python27\lib\urllib.py", line 84, in urlopen
return opener.open(url)
File "C:\Python27\lib\urllib.py", line 205, in open
return getattr(self, name)(url)
File "C:\Python27\lib\urllib.py", line 461, in open_file
return self.open_local_file(url)
File "C:\Python27\lib\urllib.py", line 475, in open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] The system cannot find the path specified: 'www.high-rely.com\\HR3\\includes\\products\\5MinOverview.php'
如果我循环通过hrefs,我会得到这个。
www.high-rely.com/HR3/includes/products/5MinOverview.php
www.high-rely.com/HR3/includes/products/10MinOverview.php
www.high-rely.com/HR3/includes/products/30MinOverview.php
www.high-rely.com/HR3/includes/HighRely/HighRely.php
www.high-rely.com/HR3/includes/HighRely/HighRely.php
www.high-rely.com/HR3/includes/RAIDFrame/RAIDFrame.php
www.high-rely.com/HR3/includes/RAIDFrame/RAIDFrame.php
www.high-rely.com/HR3/includes/MPac/MPac.php
www.high-rely.com/HR3/includes/MPac/MPac.php
www.high-rely.com/HR3/includes/BNAS/BNAS-HRS201.php
www.high-rely.com/HR3/includes/announcements.php
哪个是对的。这里发生了什么?
答案 0 :(得分:3)
您忘了写http://
:
hrefs = ['http://www.high-rely.com' + relative for relative in [x['href'] for x in links]]