使用WGET或Python从需要基本身份验证的CSV下载和重命名附件

时间:2017-01-21 15:24:24

标签: python web-scraping wget

我抓了一个我们正在使用的票务网站,现在我有一个CSV文件,如下所示:ID,Attachment_URL,Ticket_URL。我现在需要做的是下载每个附件并使用Ticket_URL重命名该文件。我遇到的主要问题是,当导航到Attachment_URL时,您必须使用基本身份验证,然后重定向到aws s3链接。我已经能够使用wget下载单个文件,但我无法遍历整个列表(35k行左右),我不知道如何将文件命名为ticket_id。任何意见,将不胜感激。

1 个答案:

答案 0 :(得分:0)

知道了。

打开经过身份验证的会话:

# -*- coding: utf-8 -*-
import requests
import re
from bs4 import BeautifulSoup
import csv
import pandas as pd
import time


s = requests.session()

payload = {
    'user': '',
    'pw': ''
}

s.post('login.url.here', data=payload)
for i in range(1, 6000):
    testURL = s.get(
        'https://urlhere.com/efw/stuff&page={}'.format(i))


    soup = BeautifulSoup(testURL.content)
    table = soup.find("table", {"class": "table-striped"})
    table_body = table.find('tbody')
    rows = table_body.find_all('tr')[1:]
    print "The current page is: " + str(i)

    for row in rows:
        cols = row.find_all('a', attrs={'href': re.compile("^/helpdesk/")})
      # time.sleep(1)
        with open('fd.csv', 'a') as f:
         writer = csv.writer(f)
         writer.writerow(cols)
         print cols
    print cols

然后我在R中清理了一些链接并下载文件。

#!  /usr/bin/env python
    import threading
    import os
    from time import gmtime, strftime
    from Queue import Queue

    import requests
    s = requests.session()

    payload = {
        'user': '',
        'pw': ''
    }
    s.post('login', data=payload)

    class log:

        def info(self, message):
            self.__message("info", message)
        def error(self, message):
            self.__message("error", message)
        def debug(self, message):
            self.__message("debug", message)
        def __message(self, log_level, message):
            date = strftime("%Y-%m-%d %H:%M:%S", gmtime())
            print "%s [%s] %s" % (date, log_level, message)


    class fetch:
        def __init__(self):
            self.temp_dir = "/tmp"


        def run_fetcher(self, queue):

            while not queue.empty():
                url, ticketid = queue.get()

                if ticketid.endswith("NA"):
                    fileName = url.split("/")[-1] + 'NoTicket'
                else:
                    fileName = ticketid.split("/")[-1]

                response = s.get(url)

                with open(os.path.join('/Users/Desktop/FolderHere', fileName + '.mp3'), 'wb') as f:

                     f.write(response.content)

                     print  fileName




                queue.task_done()


    if __name__ == '__main__':

        # load in classes
        q = Queue()
        log = log()
        fe = fetch()


        # get bucket name
        #Read in input file
        with open('/Users/name/csvfilehere.csv', 'r') as csvfile:
            for line in csvfile:
                id,url,ticket = line.split(",")
                q.put([url.strip(),ticket.strip()])

        # spin up fetcher workers
        threads = []
        for i in range(8):
            t = threading.Thread(target=fe.run_fetcher, args=(q,))
            t.daemon = True
            threads.append(t)
            t.start()

        # close threads
        [x.join() for x in threads]

        # close queue
        q.join()
        log.info("End")