Pageurls存储在sql表中,并且不包含(https://,amazon.in)之类的标头。如何从带有名称的页面URL下载图像。
from urllib.parse import urlparse
records="/Bluetooth-Earphone-Control-Smartphones-Powerful/dp/B07NBQ67BN/ref=sr_1_1?fst=as%3Aoff&qid=1554760894&refinements=p_89%3AA+%26+Y&rnid=3837712031&s=electronics&sr=1-1"
p1 = urlparse(records, 'https://')
netloc = p1.netloc
path = p1.path if p1.netloc else ''
if not netloc.startswith('amazon.in/'):
netloc = 'https://amazon.in/' +records
p2 = urlparse('.jpg', netloc, path)
p3=print(p2.geturl())
使用此代码,我可以添加标题并获取正确的url ....,但是当我尝试使用表中的url列表时,它表明str无法被串联。我也希望将图像下载到我的系统文件夹中。
答案 0 :(得分:0)
列表与|分开在表中的列?如果是这样,请尝试使用for r in recored.split('|'):
遍历所有已记录的内容
然后在r
records=["/Bluetooth-Earphone-Control-Smartphones-Powerful/dp/"
"B07NBQ67BN/ref=sr_1_1?fst=as%3Aoff&qid=1554760894&refinements=p_89%3AA+%26+Y&rnid=3837712031&s"
"=electronics&sr=1-1","/second_url/dp/sf"]
for record in records:
p1 = urlparse(record, 'https://')
netloc = p1.netloc
path = p1.path if p1.netloc else ''
if not netloc.startswith('amazon.in/'):
netloc = 'https://amazon.in/' +record
p2 = urlparse('.jpg', netloc, path)
p3=print(p2.geturl())
假设您有记录列表
records="/Bluetooth-Earphone-Control-Smartphones-Powerful/dpB07NBQ67BN/ref=sr_1_1?fst=as%3Aoff&qid=1554760894&refinements=p_89%3AA+%26+Y&rnid=3837712031&s"
for record in records.split('/'):
p1 = urlparse(record, 'https://')
netloc = p1.netloc
path = p1.path if p1.netloc else ''
if not netloc.startswith('amazon.in/'):
netloc = 'https://amazon.in/' +record
p2 = urlparse('.jpg', netloc, path)
print "{}".format(p2.geturl())
如果记录被/
分割