剥去URL的某些部分并保存文件

时间:2014-01-04 02:07:44

标签: python

http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW/ref=zg_bsms_shoes_2

我不需要最后/ref=zg_bsms_shoes_2

我有urls=[]

中的值
for productlink in products:
    self.urls.append(productlink)

def save(self):
    self.br.quit()
    f=open(self.product_file,"w")
    for url in self.urls:
        f.write(url+"\n")
        f.flush()

如何剥离它?如果我没有/ ref =,还有失败证明吗?

3 个答案:

答案 0 :(得分:2)

我强烈建议您从urlparse开始:

在python3中:

>>> import os
>>> from urllib.parse import urlparse
>>> os.path.split(urlparse(url).path)[0]
'/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'

urlparse会将URL转换为其所有组件,然后您可以通过多种方式处理路径,简单的字符串拆分,os.path.split,正则表达式,无论您喜欢什么。

在Python2中,只需使用from urlparse import urlparse

答案 1 :(得分:1)

>>> x = 'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW/ref=zg_bsms_shoes_2'
>>> '/'.join(x.split('/')[:6])
'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
>>> y = 'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'
>>> '/'.join(y.split('/')[:6])
'http://amz.com/New-Balance-WT910-Trail-Running/dp/B0098FOFCW'

答案 2 :(得分:1)

if 'ref' in url.split('/')[-1]: #Failsafe
    url = '/'.join(url.split('/')[:-1]