移除网站链接,然后下载字符

时间:2019-01-31 05:54:11

标签: python excel urllib xlsx

我目前有大约1000个要下载的excel文件的Web链接。文档名称中没有任何模式,因此我只抓取了所有Web链接,其中一些如下所示。

VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172775/Market_Information_System_Control_daily_trading_day_190130.xlsx
VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0004/172732/Market_Information_System_Control_daily_trading_day_190129.xlsx
VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0010/172675/Market_Information_System_Control_daily_trading_day_190128.xlsx
VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0009/172674/Market_Information_System_Control_daily_trading_day_190127.xlsx
VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0008/172673/Market_Information_System_Control_daily_trading_day_190126.xlsx
VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0007/172672/Market_Information_System_Control_daily_trading_day_190125.xlsx
VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172595/Market_Information_System_Control_daily_trading_day_190124.xlsx

主要问题之一是这些链接中的每个链接都在其开头有VM300:1,这不是链接的一部分。我该如何从每个链接的开头删除此“ VM300:1”,大约有1000个链接,因此手动进行操作是不可行的。

该错误修复后,我下载文件的代码仍然无法正常工作吗?

这是我当前的代码:

import urllib2

urlfiles = ['https://www.powerwater.com.au__data/assets/excel_doc/0011/172775/Market_Information_System_Control_daily_trading_day_190130.xlsx',
            'https://www.powerwater.com.au__data/assets/excel_doc/0004/172732/Market_Information_System_Control_daily_trading_day_190129.xlsx',
            'https://www.powerwater.com.au__data/assets/excel_doc/0010/172675/Market_Information_System_Control_daily_trading_day_190128.xlsx']


urllib2.urlopen(urlfiles)

任何帮助将不胜感激。

3 个答案:

答案 0 :(得分:1)

您可以像下面这样基于空格分割网址:

>>> urls = [
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172775/Market_Information_System_Control_daily_trading_day_190130.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0004/172732/Market_Information_System_Control_daily_trading_day_190129.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0010/172675/Market_Information_System_Control_daily_trading_day_190128.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0009/172674/Market_Information_System_Control_daily_trading_day_190127.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0008/172673/Market_Information_System_Control_daily_trading_day_190126.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0007/172672/Market_Information_System_Control_daily_trading_day_190125.xlsx",
"VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172595/Market_Information_System_Control_daily_trading_day_190124.xlsx"
]
>>>
>>> urlfiles = [url.split()[1] for url in urls ]
>>> print(urlfiles)
['https://www.powerwater.com.au__data/assets/excel_doc/0011/172775/Market_Information_System_Control_daily_trading_day_190130.xlsx',
 'https://www.powerwater.com.au__data/assets/excel_doc/0004/172732/Market_Information_System_Control_daily_trading_day_190129.xlsx', 
'https://www.powerwater.com.au__data/assets/excel_doc/0010/172675/Market_Information_System_Control_daily_trading_day_190128.xlsx', 
'https://www.powerwater.com.au__data/assets/excel_doc/0009/172674/Market_Information_System_Control_daily_trading_day_190127.xlsx', 
'https://www.powerwater.com.au__data/assets/excel_doc/0008/172673/Market_Information_System_Control_daily_trading_day_190126.xlsx', 
'https://www.powerwater.com.au__data/assets/excel_doc/0007/172672/Market_Information_System_Control_daily_trading_day_190125.xlsx', 
'https://www.powerwater.com.au__data/assets/excel_doc/0011/172595/Market_Information_System_Control_daily_trading_day_190124.xlsx']

除此之外,您需要遍历urlfiles中的每个url才能像这样打开它:

>>> import urllib2
>>>
>>> for url in urlfiles:
...     urllib2.urlopen(url)

答案 1 :(得分:1)

如果您的链接实际上都以'VM300:1 '开头,则需要将其删除,那么您也可以忽略前8个字符,而不必担心正则表达式。

对于下载所有这些文件,假设没有基于Cookie,会话等的限制,并使用Python 3:

import urllib.request
from pathlib import Path

urls = [
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172775/Market_Information_System_Control_daily_trading_day_190130.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0004/172732/Market_Information_System_Control_daily_trading_day_190129.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0010/172675/Market_Information_System_Control_daily_trading_day_190128.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0009/172674/Market_Information_System_Control_daily_trading_day_190127.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0008/172673/Market_Information_System_Control_daily_trading_day_190126.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0007/172672/Market_Information_System_Control_daily_trading_day_190125.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172595/Market_Information_System_Control_daily_trading_day_190124.xlsx"
]

for url in urls:
    urllib.request.urlretrieve(url=url[8:], filename=Path(url).name)

答案 2 :(得分:1)

我听说过python,却没有提到requests吗?

from pathlib import Path

import requests

urls = [
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172775/Market_Information_System_Control_daily_trading_day_190130.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0004/172732/Market_Information_System_Control_daily_trading_day_190129.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0010/172675/Market_Information_System_Control_daily_trading_day_190128.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0009/172674/Market_Information_System_Control_daily_trading_day_190127.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0008/172673/Market_Information_System_Control_daily_trading_day_190126.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0007/172672/Market_Information_System_Control_daily_trading_day_190125.xlsx",
    "VM300:1 https://www.powerwater.com.au__data/assets/excel_doc/0011/172595/Market_Information_System_Control_daily_trading_day_190124.xlsx"
]
for url in urls:
    link = url.split()[1]
    r = requests.get(link)
    with open(Path(link).name, 'wb') as f:
        f.write(r.content)