Question

我试图从网址列表中删除多个字符串。我有超过300k的网址，我试图找到原始版本的变体。这是我一直在使用的玩具示例。

URLs = ['example.com/page.html',
        'www.example.com/in/page.html',
        'example.com/ca/fr/page.html',
        'm.example.com/de/page.html',
        'example.com/fr/page.html']

locs = ['/in', '/ca', '/de', '/fr', 'm.', 'www.']

我最不想要的是没有语言或位置的网页列表：

desired_output = ['example.com/page.html',
                  'example.com/page.html',
                  'example.com/page.html',
                  'example.com/page.html',
                  'example.com/page.html']

我尝试过列表理解并嵌套for循环，但还没有任何工作。有人可以帮忙吗？

# doesn't remove anything
for item in URLs:
    for string in locs:
        re.sub(string, '', item)

# doesn't remove anything
for item in URLs:
    for string in locs:
        item.strip(string)

# only removes the last string in locs
clean = []
for item in URLs:
    for string in locs:
        new = item.replace(string, '')
    clean.append(new)

Answer 1

您必须再次将replace的结果分配给item：

clean = []
for item in URLs:
    for loc in locs:
        item = item.replace(loc, '')
    clean.append(item)

或简称：

clean = [
    reduce(lambda item,loc: item.replace(loc,''), [item]+locs)
    for item in URLs
]

Answer 2

您遇到的最大问题是您没有保存返回值。

urls = ['example.com/page.html',
        'www.example.com/in/page.html',
        'example.com/ca/fr/page.html',
        'm.example.com/de/page.html',
        'example.com/fr/page.html']

locs = ['/in', '/ca', '/de', '/fr', 'm.', 'www.']

stripped = list(urls) ## create a new copy, not necessary

for loc in locs:
    stripped = [url.replace(loc, '') for url in stripped]

在此之后，stripped等于

['example.com/page.html',
 'example.com/page.html',
 'example.com/page.html',
 'example.com/page.html',
 'example.com/page.html']

修改

或者，如果不创建新列表，则可以执行

for loc in locs: urls = [url.replace(loc, '') for url in urls]

在此之后，urls等于

['example.com/page.html', 'example.com/page.html', 'example.com/page.html', 'example.com/page.html', 'example.com/page.html']

Answer 3

您可以先将删除部分抽象为函数，然后使用列表解析：

def remove(target, strings):
    for s in strings:
        target = target.replace(s,'')
    return target

URLs = ['example.com/page.html',
        'www.example.com/in/page.html',
        'example.com/ca/fr/page.html',
        'm.example.com/de/page.html',
        'example.com/fr/page.html']

locs = ['/in', '/ca', '/de', '/fr', 'm.', 'www.']

用过：

URLs = [remove(url,locs) for url in URLs]

for url in URLs: print(url)

输出：

example.com/page.html
example.com/page.html
example.com/page.html
example.com/page.html
example.com/page.html

Python从字符串列表中删除字符串列表

3 个答案: