Question

我是python的新手并试图从列表中删除某些元素而不知道整个字符串。我正在做的是使用正则表达式从文本文档中解析出TLD。这很好，但是，它也抓取了具有文件扩展名的字符串（即myfile.exe，我不想包含它）。我的功能如下：

def find_domains(txt):

    # Regex out domains   
    lines = txt.split('\n')
    domains = []

    for line in lines:
        line  = line.rstrip()
        results = re.findall('([\w\-\.]+(?:\.|\[\.\])+[a-z]{2,6})', line)
        for item in results:
            if item not in domains:
                domains.append(item)

这很好，就像我说的那样，但我的列表最终看起来像：

domains = [＆＃39; thisisadomain.com＆＃39;，＆＃39; anotherdomain.net＆＃39;，＆＃39; a_file_I_dont_want.exe＆＃39;，＆＃39; another_file_I_dont_want.csv＆＃39;]

我尝试使用：

domains.remove（＆＃34; .exe＆＃34;）

但似乎如果我不知道整个字符串，那就不行了。有没有办法使用通配符或迭代列表来删除仅基于扩展名的未知元素？感谢您的帮助，如果需要更多信息，我会尝试提供更多信息。

Answer 1

我会使用bultin str.endswith函数。如果字符串以指定的后缀结尾，则返回True。

这是一个易于使用的功能，请参阅下面的示例。从python 2.5开始，你可以传递它的后缀元组。

def find_domains(txt):

    # Regex out domains   
    lines = txt.split('\n')
    domains = []
    unwanted_extensions = ('.exe', '.net', '.csv') # tuple containing unwanted extensions, add more if you want.

    for line in lines:
        line  = line.rstrip()
        results = re.findall('([\w\-\.]+(?:\.|\[\.\])+[a-z]{2,6})', line)
        for item in results:
            # check if item is not in domains already and if item doesn't end with any of the unwanted extensions.
            if item not in domains and not item.endswith(unwanted_extensions):
                domains.append(item)

正如您所看到的那样，只需指定您不想要的扩展程序（在unwanted_extensions元组中执行此操作，然后向if添加条件以确保item并不以任何一个结束。

如何只知道部分字符串从列表中删除元素？

1 个答案: