http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW/ref=zg_bsms_shoes_2
我不需要最后/ref=zg_bsms_shoes_2
我有urls=[]
for productlink in products:
self.urls.append(productlink)
def save(self):
self.br.quit()
f=open(self.product_file,"w")
for url in self.urls:
f.write(url+"\n")
f.flush()
如何剥离它?如果我没有/ ref =,还有失败证明吗?
答案 0 :(得分:2)
我强烈建议您从urlparse
开始:
在python3中:
>>> import os
>>> from urllib.parse import urlparse
>>> os.path.split(urlparse(url).path)[0]
'/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
urlparse
会将URL转换为其所有组件,然后您可以通过多种方式处理路径,简单的字符串拆分,os.path.split
,正则表达式,无论您喜欢什么。
在Python2中,只需使用from urlparse import urlparse
答案 1 :(得分:1)
>>> x = 'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW/ref=zg_bsms_shoes_2'
>>> '/'.join(x.split('/')[:6])
'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
>>> y = 'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
>>> '/'.join(y.split('/')[:6])
'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
答案 2 :(得分:1)
if 'ref' in url.split('/')[-1]: #Failsafe
url = '/'.join(url.split('/')[:-1]