如何从URL下载Word文档到python中指定目录中的文件夹?

时间:2016-06-27 12:37:14

标签: python web-scraping python-requests

我正在尝试将多个word文档从网站下载到我可以迭代的文件夹中。它们托管在sharepoint列表中,我已经能够解析HTML代码以编译这些word文档的所有链接的列表。这些链接(单击时)会提示您打开或保存Word文档。在这些链接的最后,doc这个词的标题也在那里。我已经能够分割URL字符串,以获得与我的URL列表对齐的单词文档的名称列表。我的目标是编写一个循环,遍历所有URL并将所有word文档下载到文件夹中。编辑 - 考虑@DeepSpace和@aneroid的建议(并尽力实现它们)......我的代码 -

 import requests
 from requests_ntlm import HttpNtlmAuth
 import shutil

 def download_word_docs(doc_url, doc_name):
    r = requests.get(doc_url, auth=HttpNtlmAuth(domain\\user, pass), stream=True)
    with open(doc_name, 'wb') as f:                                                                                                                                                
       shutil.copyfileobj(r.raw, f) #where's it copying the fileobj to?

我认为这与图像不同,因为我的请求是下载链接而不是物理jpeg图像......我可能错了,但这是一个棘手的情况。

我仍然试图让我的程序将.docx下载(或创建一个副本)到具有指定路径的文件夹(我可以设置)。目前它在管理命令提示符(我在Windows上)运行没有错误,但我不知道它在哪里复制文件。我的希望是,如果我可以找到一个工作,我可以弄清楚如何将它循环到我的URL列表。谢谢你们(@DeepSpace和@aneroid)到目前为止的帮助。

2 个答案:

答案 0 :(得分:0)

在你的代码中,你提到了

  

“有什么方法可以避免打开/写入新文件并直接下载?”

没有直接下载。这就是浏览器通过类似于您尝试编写的代码所做的事情。它们是“使用服务器或URL指定的名称创建新文件”。

我几天前写了一些其他的东西,类似于answer linked by @DeepSpace

def save_link(book_link, book_name):
    the_book = requests.get(book_link, stream=True)
    with open(book_name, 'wb') as f:
        for chunk in the_book.iter_content(1024 * 1024 * 2):  # 2 MB chunks
            f.write(chunk)

book_name是从另一个函数中的book_link文本中检索出来的,但你也可以这样做:

  1. 检查响应标头是否包含文件名。

  2. 如果没有,请尽可能使用URL的结尾作为文件名:

    >>> the_link = 'http://example.com/some_path/Special%20document.doc'
    >>> filename = urllib.unquote_plus(the_link.split('/')[-1])
    >>> print filename
    Special document.doc
    >>> # then do
    ... with open(filename, 'wb') as f:
    ....    # etc.
    

答案 1 :(得分:-1)

尝试此代码并查看它是否适合您:

from urllib.request import Request, urlopen

def get_html(url, timeout = 15):
    ''' function returns html of url
    usually html = urlopen(url) is enough but sometimes it doesn't work
    also instead urllib.request you can use any other method to get html
    code of url like urllib or urllib2 (just search it online), but I
    think urllib.request comes with python installation'''

    html = ''
    try:
        html = urlopen(url, None, timeout)
    except:
        url = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
        try:
            html = urlopen(url, None, timeout)
        except:
            pass
    return html

def get_current_path():
    ''' function returns path of folder in which python program is saved'''

    try:
        path = __file__
    except:
        try:
            import sys
            path = sys.argv[0]
        except:
            path = ''
    if path:
        if '\\' in path:
            path = path.replace('\\', '/')
        end = len(path) - path[::-1].find('/')
        path = path[:end]
    return path

def check_if_name_already_exists(name, path, extension):
    ''' function checks if there is already existing file
    with same name in folder given by path.'''

    try:
        file = open(path + name + extension, 'r')
        file.close()
        return True
    except:
        return False

def get_new_name(old_name, path, extension):
    ''' functions ask user to enter new name for file and returns inputted name.'''

    print('File with name "{}" already exist.'.format(old_name))
    answer = input('Would you like to replace it (answer with "r")\nor create new one (answer with "n") ? ')
    while answer not in 'rRnN':
        print('Your answer is inconclusive')
        print('Please answer again:')
        print('if you would like to replece the existing file answer with "r"')
        print('if you would like to create new one answer with "n"')
        answer = input('Would you like to replace it (answer with "r")\n or create new one (answer with "n") ? ')
    if answer in 'nN':
        new_name = input('Enter new name for file: ')
        if check_if_name_already_exists(new_name, path, extension):
            return get_new_name(new_name, path)
        else:
            return new_name
    if answer in 'rR':
        return old_name

def get_url_extension(url):
    if url[::-1].find('cod.') == 0:
        return '.doc'
    if url[::-1].find('xcod.') == 0:
        return '.docx'

def download_word(url, name = 'document', path = None):
    '''function downloads word file from its url
    required argument is url of pdf file and
    optional argument is name for saved pdf file and
    optional argument path if you want to choose where is your file saved
    variable path must look like:
        'C:\\Users\\Computer name\\Desktop' or
        'C:/Users/Computer name/Desktop' '''
    # and not like
    #   'C:\Users\Computer name\Desktop'

    word = get_html(url)
    extension = get_url_extension(url)

    name = name.replace(extension, '')
    if path == None:
        path = get_current_path()
    if '\\' in path:
        path = path.replace('\\', '/')
    if path[-1] != '/':
        path += '/'
    if path:
        check = check_if_name_already_exists(name, path, extension)
        if check:
            if name == 'document':
                i = 1
                name = 'document(' + str(i) + ')'
                while check_if_name_already_exists(name, path, extension):
                    i += 1
                    name = 'document(' + str(i) + ')'
            else:
                name = get_new_name(name, path, extension)
        file = open(path+name + extension, 'wb')
    else:
        file = open(name + extension, 'wb')

    file.write(word.read())
    file.close()
    if path:
        print(name + extension + ' file downloaded in folder "{}".'.format(path))
    else:
        print(name + extension + ' file downloaded.')
    return


download_url = 'http://www.scripps.edu/library/open/instruction/googletips.doc'
download_url = 'http://regionblekinge.se/a/uploads/dokument/demo.docx'
download_word(download_url)