从CSV文件中的ULS下载照片– urllib.error.HTTPError:HTTP错误403:禁止

时间:2020-10-12 16:46:46

标签: python python-3.x beautifulsoup urllib http-status-code-403

我下面的脚本应该从URL列表中下载一堆图像,但它不断遇到HTTP Error 403: Forbidden出现的错误raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden错误。

不确定该怎么做。您可以自己运行它,我在下面提供了所有内容。

任何帮助将不胜感激(:

目标是从CSV格式的URL列表中下载一堆图像,而不会出现错误403

from bs4 import BeautifulSoup
from time import sleep
import urllib.request
import requests
import praw
import csv

r = praw.Reddit(client_id=client_id,
                client_secret=client_secret, 
                user_agent=user_agent,
                username=username,
                password=password)

subred = r.subreddit("partyparrot")
top = subred.top(limit = 780)
type(top)
x = next(top)
dir(x)

with open("output_reddit.csv", 'r') as csvfile:

    headers = {
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
                'Accept-Encoding': 'none',
                'Accept-Language': 'en-US,en;q=0.8',
                'Connection': 'keep-alive',
                'Access-Control-Allow-Origin': '*',
                'Access-Control-Allow-Methods': 'GET',
                'Access-Control-Allow-Headers': 'Content-Type',
                'Access-Control-Max-Age': '3600',
                'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
                }

    for line in csvfile:
        splitted_line = line.split('||')
        if splitted_line[2] != '' and splitted_line[2] != "\n" and ".png" in splitted_line[2]:
            urllib.request.urlretrieve(splitted_line[2], filename=("img_" + splitted_line[0] + ".png")) 
            print ("Image saved for {0}".format(splitted_line[0]))

        elif splitted_line[2] != '' and splitted_line[2] != "\n" and ".jpg" in splitted_line[2]:
            urllib.request.urlretrieve(splitted_line[2], filename=("img_" + splitted_line[0] + ".jpg")) 
            print ("Image saved for {0}".format(splitted_line[0]))

        elif splitted_line[2] != '' and splitted_line[2] != "\n" and "v.redd.it" in splitted_line[2]:

            urllib.request.urlretrieve(splitted_line[2].rstrip() + "/DASH_720.mp4", filename=("img_" + splitted_line[0] + ".mp4")) 
            print ("Image saved for {0}".format(splitted_line[0]))

        else:
            print ("No result for {0}".format(splitted_line[0]))


下面是output_reddit.csv文件供参考。

2||I tried the no pet challenge... she wasn't having it||https://v.redd.it/da60x1qizgs51
3||My trip to the salon went horribly wrong.||https://v.redd.it/tfzc1vye6ds51
4||A few sketches of my macaw buddy from work. Haven't seen this silly girl in six months due to quarantine, I miss her.||https://i.redd.it/jjkb3b5ntis51.jpg
5||Thermals of the party girl!||https://i.imgur.com/rfGChUQ.jpg
6||I present you with Lorena. After rescue, I found out shes an old bird and mostly blind. Once allowed out of her cage to roam free and was given plenty of wonderful fruits and veggies, she became very warm and cuddly. Shes a very sweet regal lady and definitely a queen.||https://v.redd.it/saojuaycnds51
7||A day in the life of the OG Party Parrot. Credit: Ranger Sarah Little.||https://i.redd.it/wjwvl3u01js51.jpg
8||Party game||https://v.redd.it/8myoampepgs51
9||Here I present to you the Christmas loving partyparrot named Felix. He loved sitting in the tree but never chewed on it. Now rest in peace, little friend we will allways love and remember you.||https://i.redd.it/wengcned7is51.jpg

下面也是完整日志,供参考。

Matts-MacBook-Pro-5:Download matt$ python run.py
Image saved for 2
Image saved for 3
Image saved for 4
Image saved for 5
Traceback (most recent call last):
  File "run.py", line 107, in <module>
    urllib.request.urlretrieve(splitted_line[2].rstrip() + "/DASH_720.mp4", filename=("img_" + splitted_line[0] + ".mp4")) 
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 247, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

1 个答案:

答案 0 :(得分:-1)

我意识到我遇到了错误403,因为该URL在某种程度上不可用。

我解决了更改问题

class SessionListView(LoginRequiredMixin, ListView):
    model = Session, Profile
    template_name = 'blog/database.html'
    context_object_name = 'sessions'
    ordering = ['-session_date']
    paginate_by = 25

    def get_queryset(self):
        user = get_object_or_404(User, username=self.kwargs.get('username'))
        return Session.objects.filter(client=user).order_by('-session_date')

    def get_context_data(self, **kwargs):
        user = get_object_or_404(User, username=self.kwargs.get('username'))
        context = super().get_context_data(**kwargs)
        context['distinct_campaigns'] = Session.objects.filter(client=user).values('cid').distinct().order_by('cid')
        context['distinct_action_types'] = Session.objects.filter(client=user)\
        .values('action_type')\
        .distinct().order_by('action_type')
        return context

@ login_required()
def database(request):
    context = {
    'sessions': Session.objects.all()
    }
    return render(request, 'blog/database.html', context, {'title': 'Database'})

#include <atomic>
#include <thread>
#include <queue>
#include <mutex>
#include <conio.h>

std::atomic<bool> run = true;
std::queue<int> input;
std::mutex guard;  // to protects input queue

void keyboard()
{
  while (run)
  {
    int keypress = _getch();
    // TODO: if keypress indicates Exit - set run = false;
    // Lock the queue for safe multi-thread access
    {
      const std::lock_guard<std::mutex> lock(guard);
      input.push(keypress);
    }
  }
}

int main()
{
  std::thread keyListener(keyboard);
  // TODO: start your TCP thread
  while (run)
  {
    // 1. Process input (keyboard, mouse, voice, etc.)
    {
      // Lock the queue for safe multi-thread access
      const std::lock_guard<std::mutex> lock(guard);
      // Pop all collected keys and process them
      while (!input.empty())
      {
        int keypress = input.front();
        input.pop();
        // TODO: Send game-related keys to the TCP thread
      }
    }

    // 2. Process TCP input, probably with another queue

    // 3. Update the Game World

    // 4. Display the Game World
  }
  keyListener.join();
  // TODO: join your TCP thread
}

这允许我跳过以某种方式损坏的URL。