从pageurl下载图像列表

时间:2019-05-24 05:08:24

标签: python python-requests

Pageurls存储在sql表中,并且不包含(https://,amazon.in)之类的标头。如何从带有名称的页面URL下载图像。

from urllib.parse import urlparse
records="/Bluetooth-Earphone-Control-Smartphones-Powerful/dp/B07NBQ67BN/ref=sr_1_1?fst=as%3Aoff&qid=1554760894&refinements=p_89%3AA+%26+Y&rnid=3837712031&s=electronics&sr=1-1"
p1 = urlparse(records, 'https://')
netloc = p1.netloc
path = p1.path if p1.netloc else ''
if not netloc.startswith('amazon.in/'):
   netloc = 'https://amazon.in/' +records
   p2 = urlparse('.jpg', netloc, path)
   p3=print(p2.geturl())

使用此代码,我可以添加标题并获取正确的url ....,但是当我尝试使用表中的url列表时,它表明str无法被串联。我也希望将图像下载到我的系统文件夹中。

1 个答案:

答案 0 :(得分:0)

列表与|分开在表中的列?如果是这样,请尝试使用for r in recored.split('|'):遍历所有已记录的内容 然后在r

上使用相同的逻辑
records=["/Bluetooth-Earphone-Control-Smartphones-Powerful/dp/"
         "B07NBQ67BN/ref=sr_1_1?fst=as%3Aoff&qid=1554760894&refinements=p_89%3AA+%26+Y&rnid=3837712031&s"
         "=electronics&sr=1-1","/second_url/dp/sf"]
for record in records:
    p1 = urlparse(record, 'https://')
    netloc = p1.netloc
    path = p1.path if p1.netloc else ''
    if not netloc.startswith('amazon.in/'):
       netloc = 'https://amazon.in/' +record
       p2 = urlparse('.jpg', netloc, path)
       p3=print(p2.geturl())

假设您有记录列表

records="/Bluetooth-Earphone-Control-Smartphones-Powerful/dpB07NBQ67BN/ref=sr_1_1?fst=as%3Aoff&qid=1554760894&refinements=p_89%3AA+%26+Y&rnid=3837712031&s"
for record in records.split('/'):
    p1 = urlparse(record, 'https://')
    netloc = p1.netloc
    path = p1.path if p1.netloc else ''
    if not netloc.startswith('amazon.in/'):
       netloc = 'https://amazon.in/' +record
       p2 = urlparse('.jpg', netloc, path)
       print "{}".format(p2.geturl())

如果记录被/

分割