我正在通过自动化无聊的东西来学习Python。该程序应该转到http://xkcd.com/并下载所有图像以供离线查看。
我使用的是2.7版和Mac版。
出于某种原因,我收到的错误如“没有提供架构”和使用request.get()本身的错误。
这是我的代码:
# Saves the XKCD comic page for offline read
import requests, os, bs4, shutil
url = 'http://xkcd.com/'
if os.path.isdir('xkcd') == True: # If xkcd folder already exists
shutil.rmtree('xkcd') # delete it
else: # otherwise
os.makedirs('xkcd') # Creates xkcd foulder.
while not url.endswith('#'): # If there are no more posts, it url will endswith #, exist while loop
# Download the page
print 'Downloading %s page...' % url
res = requests.get(url) # Get the page
res.raise_for_status() # Check for errors
soup = bs4.BeautifulSoup(res.text) # Dowload the page
# Find the URL of the comic image
comicElem = soup.select('#comic img') # Any #comic img it finds will be saved as a list in comicElem
if comicElem == []: # if the list is empty
print 'Couldn\'t find the image!'
else:
comicUrl = comicElem[0].get('src') # Get the first index in comicElem (the image) and save to
# comicUrl
# Download the image
print 'Downloading the %s image...' % (comicUrl)
res = requests.get(comicUrl) # Get the image. Getting something will always use requests.get()
res.raise_for_status() # Check for errors
# Save image to ./xkcd
imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)), 'wb')
for chunk in res.iter_content(10000):
imageFile.write(chunk)
imageFile.close()
# Get the Prev btn's URL
prevLink = soup.select('a[rel="prev"]')[0]
# The Previous button is first <a rel="prev" href="/1535/" accesskey="p">< Prev</a>
url = 'http://xkcd.com/' + prevLink.get('href')
# adds /1535/ to http://xkcd.com/
print 'Done!'
以下是错误:
Traceback (most recent call last):
File "/Users/XKCD.py", line 30, in <module>
res = requests.get(comicUrl) # Get the image. Getting something will always use requests.get()
File "/Library/Python/2.7/site-packages/requests/api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "/Library/Python/2.7/site-packages/requests/api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 451, in request
prep = self.prepare_request(req)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 382, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "/Library/Python/2.7/site-packages/requests/models.py", line 304, in prepare
self.prepare_url(url, params)
File "/Library/Python/2.7/site-packages/requests/models.py", line 362, in prepare_url
to_native_string(url, 'utf8')))
requests.exceptions.MissingSchema: Invalid URL '//imgs.xkcd.com/comics/the_martian.png': No schema supplied. Perhaps you meant http:////imgs.xkcd.com/comics/the_martian.png?
问题是我一直在阅读书中关于该程序的部分多次,阅读请求文档,以及在这里查看其他问题。我的语法看起来正确。
感谢您的帮助!
编辑:
这不起作用:
comicUrl = ("http:"+comicElem[0].get('src'))
我认为添加http:之前会摆脱没有架构提供的错误。
答案 0 :(得分:10)
没有架构意味着你没有提供http://
或https://
提供这些,这样就可以了。
编辑:看看这个URL字符串!:
URL '//imgs.xkcd.com/comics/the_martian.png':
答案 1 :(得分:7)
将您的comicUrl = comicElem[0].get('src').strip("http://")
comicUrl="http://"+comicUrl
if 'xkcd' not in comicUrl:
comicUrl=comicUrl[:7]+'xkcd.com/'+comicUrl[7:]
print "comic url",comicUrl
更改为此
<!DOCTYPE html>
<html>
<head>
<style>
#myDIV {
width: 500px;
height: 500px;
background-color: lightblue;
}
</style>
</head>
<body>
<p>Click the "Try it" button to set the display property of the DIV element to "none":</p>
<button onclick="myFunction()">Try it</button>
<div id="myDIV">
This is my DIV element.
</div>
<p><b>Note:</b> The element will not take up any space when the display property set to "none".</p>
<script>
function myFunction() {
document.getElementById("myDIV").style.display = "none";
}
</script>
</body>
</html>
答案 2 :(得分:1)
<强>解释强>
一些XKCD页面的特殊内容不是简单的图像文件。没关系;你可以跳过这些。如果您的选择器找不到任何元素,则soup.select('#comic img')将返回一个空白列表。
工作代码:
import requests,os,bs4,shutil
url='http://xkcd.com'
#making new folder
if os.path.isdir('xkcd') == True:
shutil.rmtree('xkcd')
else:
os.makedirs('xkcd')
#scrapiing information
while not url.endswith('#'):
print('Downloading Page %s.....' %(url))
res = requests.get(url) #getting page
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
comicElem = soup.select('#comic img') #getting img tag under comic divison
if comicElem == []: #if not found print error
print('could not find comic image')
else:
try:
comicUrl = 'http:' + comicElem[0].get('src') #getting comic url and then downloading its image
print('Downloading image %s.....' %(comicUrl))
res = requests.get(comicUrl)
res.raise_for_status()
except requests.exceptions.MissingSchema:
#skip if not a normal image file
prev = soup.select('a[rel="prev"]')[0]
url = 'http://xkcd.com' + prev.get('href')
continue
imageFile = open(os.path.join('xkcd',os.path.basename(comicUrl)),'wb') #write downloaded image to hard disk
for chunk in res.iter_content(10000):
imageFile.write(chunk)
imageFile.close()
#get previous link and update url
prev = soup.select('a[rel="prev"]')[0]
url = "http://xkcd.com" + prev.get('href')
print('Done...')
答案 3 :(得分:0)
我只是喜欢在这里说我有这个完全相同的错误,并使用@Ajay推荐的答案,但即使添加后我仍然遇到问题,在程序下载第一张图像后,它会停止并返回此错误:
ValueError: Unsupported or invalid CSS selector: "a[rel"
这是指程序中最后一行使用&#39; Prev按钮&#39;转到下一个图像下载。
无论如何,在浏览了bs4文档后,我做了如下的微小更改,现在似乎工作正常:
prevLink = soup.select('a[rel^="prev"]')[0]
其他人可能会遇到同样的问题,所以我想添加此评论。
答案 4 :(得分:0)
实际上这并不是一个大问题。您可以看到类似//imgs.xkcd.com/comics/acceptable_risk.png
的comicUrl
您唯一需要添加的是http:
,请记住它是http:
,而不是某些人早先所说的http://
,因为url包含双斜杠。
所以请将代码更改为
res = requests.get('http:' + comicElem[0].get('src'))
或
comicUrl = 'http:' + comicElem[0].get('src')
res = requests.get(comicUrl)
快乐编码
答案 5 :(得分:0)
我有一个类似的问题。它以某种方式将响应代码 400 作为要解析的 url,因此很明显该 url 无效。这是我的代码和错误:
import cloudscraper # to bypass cloudflare that is blocking requests with the request module
import time
import random
import json
import socket
from collections import OrderedDict
from requests import Session
with open("conf.json") as conf:
config = json.load(conf)
addon_api = config.get("Addon API")
addonapi_url = config.get("Addon URL")
addonapi_ip = config.get("Addon IP")
addonapi_agent = config.get("Addon User-agent")
# getip = socket.getaddrinfo("https://my.url.com", 443)
# (family, type, proto, canonname, (address, port)) = getip[0]
# family, type, proto, canonname, (address, port)) = getip[0]
session = Session()
headers = OrderedDict({
'Accept-Encoding': 'gzip, deflate, br',
'Host': addonapi_ip,
'User-Agent': addonapi_agent
})
session.headers = headers
# define the Data we will post to the Website
data = {
"apikey": addon_api,
"action": "get_user_info",
"value": "username"
}
try: # try-block to handle exceptions if the request Failed
randomsleep1 = random.randint(10, 30)
randomsleep2 = random.randint(10, 30)
randomsleep_total = randomsleep1 + randomsleep2
data_variable = data
headers_variable = headers
payload = {"key1": addonapi_ip, "key2": data_variable, "key3": headers_variable}
getrequest = session.get(url=addonapi_ip, data=data_variable, headers=headers_variable, params = payload)
postrequest = session.get(url=addonapi_ip, data=data_variable, headers=headers_variable, params = payload) # sending Data to the Website
print(addonapi_ip)
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
print(f"Sleeping for {randomsleep1} Seconds before posting Data to API!")
time.sleep(randomsleep1)
session.get(postrequest) # sending Data to the Website
print(f"Sleeping for {randomsleep2} Seconds before getting Data from API!")
time.sleep(randomsleep2)
print(f"Total Seconds i slept during the Request: {randomsleep_total}")
session.post(postrequest)
print(f"Data sent: {postrequest}")
print(f"Data recived: {getrequest}") # printing the output from the Request into our Terminal
# post = requests.post(addonapi_url, data=data, headers=headers)
# print(post.status_code)
# print(post.text)
except Exception as e:
raise e
# print(e) # print a error if occurced
# =========================================== #
Sleeping for 15 Seconds before posting Data to API!
Traceback (most recent call last):
File "C:\Users\You.Dont.See.My.Name\PythonProjects\addon_bot\addon.py", line 69, in <module>
raise e
File "C:\Users\You.Dont.See.My.Name\PythonProjects\addon_bot\addon.py", line 55, in <module>
session.get(postrequest) # sending Data to the Website
File "P:\Documents\IT\Python\lib\site-packages\requests\sessions.py", line 546, in get
return self.request('GET', url, **kwargs)
File "P:\Documents\IT\Python\lib\site-packages\requests\sessions.py", line 519, in request
prep = self.prepare_request(req)
File "P:\Documents\IT\Python\lib\site-packages\requests\sessions.py", line 452, in prepare_request
p.prepare(
File "P:\Documents\IT\Python\lib\site-packages\requests\models.py", line 313, in prepare
self.prepare_url(url, params)
File "P:\Documents\IT\Python\lib\site-packages\requests\models.py", line 387, in prepare_url
raise MissingSchema(error)
requests.exceptions.MissingSchema: Invalid URL '<Response [400]>': No schema supplied. Perhaps you meant http://<Response [400]>?