列表中的字符串项:如何删除某些关键字?

时间:2015-09-28 20:54:14

标签: string python-2.7 get web-scraping strip

我有一组如下所示的链接:

links = ['http://www.website.com/category/subcategory/1',
'http://www.website.com/category/subcategory/2',
'http://www.website.com/category/subcategory/3',...]

我想从此列表中提取123等,并将提取的数据存储在subcategory_explicit中。它们存储为str,但我无法使用以下代码访问它们:

subcategory_explicit = [cat.get('subcategory') for cat in links if cat.get('subcategory') is not None]

我是否必须将数据类型从str更改为其他内容?获取和存储提取的值的更好方法是什么?

2 个答案:

答案 0 :(得分:1)

subcategory_explicit = [i[i.find('subcategory'):] for i in links if 'subcategory' in i]

这通过切片使用子字符串,从“子类别”中的“s”开始直到字符串结束。通过将len('subcategory')添加到find的值,您可以排除“子类别”并获取“/#”(其中#是任意数字)。

答案 1 :(得分:1)

试试这个(使用re模块):

import re

links = [
    'http://www.website.com/category/subcategory/1',
    'http://www.website.com/category/subcategory/2',
    'http://www.website.com/category/subcategory/3']

d = "|".join(links)
# 'http://www.website.com/category/subcategory/1|http://www.website.com/category/subcategory/2|http://www.website.com/category/subcategory/3'

pattern = re.compile("/category/(?P<category_name>\w+)/\d+", re.I)
subcategory_explicit = pattern.findall(d)

print(subcategory_explicit)