Question

我正在写一个小刮刀。这是迄今为止的代码。

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re

soup = BeautifulSoup(
    urlopen('http://www.high-rely.com/HR3/includes/ProductFamily.php').read()
    )

links = soup.findAll('a', 'visible_link')

hrefs = ['www.high-rely.com' + relative for relative in [x['href'] for x in links]]

subpages = map(BeautifulSoup, [urlopen(x).read() for x in hrefs])

当我运行它时，我收到以下错误。

Traceback (most recent call last):
  File "C:/Users/josh.SCL/Desktop/Scraper.py", line 13, in <module>
    subpages = map(BeautifulSoup, [urlopen(x).read() for x in hrefs])
  File "C:\Python27\lib\urllib.py", line 84, in urlopen
    return opener.open(url)
  File "C:\Python27\lib\urllib.py", line 205, in open
    return getattr(self, name)(url)
  File "C:\Python27\lib\urllib.py", line 461, in open_file
    return self.open_local_file(url)
  File "C:\Python27\lib\urllib.py", line 475, in open_local_file
    raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] The system cannot find the path specified: 'www.high-rely.com\\HR3\\includes\\products\\5MinOverview.php'

如果我循环通过hrefs，我会得到这个。

www.high-rely.com/HR3/includes/products/5MinOverview.php
www.high-rely.com/HR3/includes/products/10MinOverview.php
www.high-rely.com/HR3/includes/products/30MinOverview.php
www.high-rely.com/HR3/includes/HighRely/HighRely.php
www.high-rely.com/HR3/includes/HighRely/HighRely.php
www.high-rely.com/HR3/includes/RAIDFrame/RAIDFrame.php
www.high-rely.com/HR3/includes/RAIDFrame/RAIDFrame.php
www.high-rely.com/HR3/includes/MPac/MPac.php
www.high-rely.com/HR3/includes/MPac/MPac.php
www.high-rely.com/HR3/includes/BNAS/BNAS-HRS201.php
www.high-rely.com/HR3/includes/announcements.php

哪个是对的。这里发生了什么？

Answer 1

您忘了写http://：

hrefs = ['http://www.high-rely.com' + relative for relative in [x['href'] for x in links]]

urllib正在破坏我的网址

1 个答案: