如何解决'提高ValueError(“未知的URL类型:%r”%self.full_url)ValueError:未知的URL类型:'''

时间:2019-08-09 17:15:51

标签: python beautifulsoup urllib

因此,我一直在跟踪一些视频以学习python,但无法摆脱此错误。我有其他语言的经验,因此通常可以纠正错误,但是无论我做什么,我都会遇到相同的错误或不同的错误。

我尝试将参数从'xml'切换为'lxml',但这只会改变我得到的错误

from bs4 import BeautifulSoup
import urllib.request



req = urllib.request.urlopen('http://pythonprogramming.net/')


xml = BeautifulSoup(req, 'xml')

for item in xml.findAll('link'):
    url = item.text
    news = urllib.request.urlopen(url).read()
    print(news)

理想情况下,这会打印出链接标记中的一些文本,但是,出现以下错误-

使用xml时出错-

  File "/Users/rodrigo/Desktop/ALL/Programming/Python/Python Web Programming/Working with HTML/scrapingParagraphData.py", line 13, in <module>
    news = urllib.request.urlopen(url).read()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 548, in _open
    'unknown_open', req)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1387, in unknown_open
    raise URLError('unknown url type: %s' % type)
urllib.error.URLError: <urlopen error unknown url type: @media (min-width>

使用lxml时出错-

  File "/Users/rodrigo/Desktop/ALL/Programming/Python/Python Web Programming/Working with HTML/scrapingParagraphData.py", line 13, in <module>
    news = urllib.request.urlopen(url).read()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 510, in open
    req = Request(fullurl, data)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 328, in __init__
    self.full_url = url
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 354, in full_url
    self._parse()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 383, in _parse
    raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: ''

1 个答案:

答案 0 :(得分:0)

您当前的代码针对链接元素,并且提取文本而不是href,因此没有已知的协议可以使用。

即使您提取了href,它们也是相对的,所以您对未知协议仍然会遇到问题。

item['href']会给出:

/static/favicon.ico
/static/css/materialize.min.css
https://fonts.googleapis.com/icon?family=Material+Icons
/static/css/bootstrap.css

我认为您不喜欢这些类型的链接。如果您在教程链接之后,那么您需要针对这些元素的东西,例如

tutorial_links = ['https://pythonprogramming.net' + i['href'] for i in xml.select('.waves-light.btn')]

我可能会将BeautifulSoup(req, 'lxml')中的赋值变量重命名为:

from bs4 import BeautifulSoup
import urllib.request

req = urllib.request.urlopen('http://pythonprogramming.net/')
soup = BeautifulSoup(req, 'lxml')
tutorial_links = ['https://pythonprogramming.net' + i['href'] for i in xml.select('.waves-light.btn')]