因此,我一直在跟踪一些视频以学习python,但无法摆脱此错误。我有其他语言的经验,因此通常可以纠正错误,但是无论我做什么,我都会遇到相同的错误或不同的错误。
我尝试将参数从'xml'切换为'lxml',但这只会改变我得到的错误
from bs4 import BeautifulSoup
import urllib.request
req = urllib.request.urlopen('http://pythonprogramming.net/')
xml = BeautifulSoup(req, 'xml')
for item in xml.findAll('link'):
url = item.text
news = urllib.request.urlopen(url).read()
print(news)
理想情况下,这会打印出链接标记中的一些文本,但是,出现以下错误-
使用xml时出错-
File "/Users/rodrigo/Desktop/ALL/Programming/Python/Python Web Programming/Working with HTML/scrapingParagraphData.py", line 13, in <module>
news = urllib.request.urlopen(url).read()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 548, in _open
'unknown_open', req)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 503, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 1387, in unknown_open
raise URLError('unknown url type: %s' % type)
urllib.error.URLError: <urlopen error unknown url type: @media (min-width>
使用lxml时出错-
File "/Users/rodrigo/Desktop/ALL/Programming/Python/Python Web Programming/Working with HTML/scrapingParagraphData.py", line 13, in <module>
news = urllib.request.urlopen(url).read()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 510, in open
req = Request(fullurl, data)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 328, in __init__
self.full_url = url
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 354, in full_url
self._parse()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 383, in _parse
raise ValueError("unknown url type: %r" % self.full_url)
ValueError: unknown url type: ''
答案 0 :(得分:0)
您当前的代码针对链接元素,并且提取文本而不是href,因此没有已知的协议可以使用。
即使您提取了href,它们也是相对的,所以您对未知协议仍然会遇到问题。
item['href']
会给出:
/static/favicon.ico
/static/css/materialize.min.css
https://fonts.googleapis.com/icon?family=Material+Icons
/static/css/bootstrap.css
我认为您不喜欢这些类型的链接。如果您在教程链接之后,那么您需要针对这些元素的东西,例如
tutorial_links = ['https://pythonprogramming.net' + i['href'] for i in xml.select('.waves-light.btn')]
我可能会将BeautifulSoup(req, 'lxml')
中的赋值变量重命名为:
from bs4 import BeautifulSoup
import urllib.request
req = urllib.request.urlopen('http://pythonprogramming.net/')
soup = BeautifulSoup(req, 'lxml')
tutorial_links = ['https://pythonprogramming.net' + i['href'] for i in xml.select('.waves-light.btn')]