我试图从网址列表中删除多个字符串。我有超过300k的网址,我试图找到原始版本的变体。这是我一直在使用的玩具示例。
URLs = ['example.com/page.html',
'www.example.com/in/page.html',
'example.com/ca/fr/page.html',
'm.example.com/de/page.html',
'example.com/fr/page.html']
locs = ['/in', '/ca', '/de', '/fr', 'm.', 'www.']
我最不想要的是没有语言或位置的网页列表:
desired_output = ['example.com/page.html',
'example.com/page.html',
'example.com/page.html',
'example.com/page.html',
'example.com/page.html']
我尝试过列表理解并嵌套for循环,但还没有任何工作。有人可以帮忙吗?
# doesn't remove anything
for item in URLs:
for string in locs:
re.sub(string, '', item)
# doesn't remove anything
for item in URLs:
for string in locs:
item.strip(string)
# only removes the last string in locs
clean = []
for item in URLs:
for string in locs:
new = item.replace(string, '')
clean.append(new)
答案 0 :(得分:4)
您必须再次将replace
的结果分配给item
:
clean = []
for item in URLs:
for loc in locs:
item = item.replace(loc, '')
clean.append(item)
或简称:
clean = [
reduce(lambda item,loc: item.replace(loc,''), [item]+locs)
for item in URLs
]
答案 1 :(得分:3)
您遇到的最大问题是您没有保存返回值。
urls = ['example.com/page.html',
'www.example.com/in/page.html',
'example.com/ca/fr/page.html',
'm.example.com/de/page.html',
'example.com/fr/page.html']
locs = ['/in', '/ca', '/de', '/fr', 'm.', 'www.']
stripped = list(urls) ## create a new copy, not necessary
for loc in locs:
stripped = [url.replace(loc, '') for url in stripped]
在此之后,stripped
等于
['example.com/page.html',
'example.com/page.html',
'example.com/page.html',
'example.com/page.html',
'example.com/page.html']
修改强>
或者,如果不创建新列表,则可以执行
for loc in locs:
urls = [url.replace(loc, '') for url in urls]
在此之后,urls
等于
['example.com/page.html',
'example.com/page.html',
'example.com/page.html',
'example.com/page.html',
'example.com/page.html']
答案 2 :(得分:2)
您可以先将删除部分抽象为函数,然后使用列表解析:
def remove(target, strings):
for s in strings:
target = target.replace(s,'')
return target
URLs = ['example.com/page.html',
'www.example.com/in/page.html',
'example.com/ca/fr/page.html',
'm.example.com/de/page.html',
'example.com/fr/page.html']
locs = ['/in', '/ca', '/de', '/fr', 'm.', 'www.']
用过:
URLs = [remove(url,locs) for url in URLs]
for url in URLs: print(url)
输出:
example.com/page.html
example.com/page.html
example.com/page.html
example.com/page.html
example.com/page.html