Python字匹配

时间:2017-07-28 04:21:50

标签: python python-2.7

我有一个网址列表,我正在尝试使用特定关键字过滤它们,例如word1和word2,停止词列表说[stop1,stop2,stop3]。有没有办法过滤链接而不使用很多if条件?当我在每个停止词上使用if条件时,我得到了正确的输出,这看起来不是一个可行的选项。以下是暴力方法:

for link in url:
   if word1 or word2 in link:
      if stop1 not in link:
          if stop2 not in link:
              if stop3 not in link:
                  links.append(link)

3 个答案:

答案 0 :(得分:1)

如果我遇到你的情况,我会考虑几个选项。

您可以使用内置anyall功能的列表理解来过滤掉列表中不需要的网址:

urls = ['http://somewebsite.tld/word',
        'http://somewebsite.tld/word1',
        'http://somewebsite.tld/word1/stop3',
        'http://somewebsite.tld/word2',
        'http://somewebsite.tld/word2/stop2',
        'http://somewebsite.tld/word3',
        'http://somewebsite.tld/stop3/word1',
        'http://somewebsite.tld/stop4/word1']

includes = ['word1', 'word2']
excludes = ['stop1', 'stop2', 'stop3']

filtered_url_list = [url for url in urls if any(include in url for include in includes) if all(exclude not in url for exclude in excludes)]

或者你可以创建一个以一个url作为参数的函数,并为你想保留的url返回True,为你不保留的那些返回False,然后将该函数与未经过滤的内置filter函数的网址列表:

def urlfilter(url):
    includes = ['word1', 'word2']
    excludes = ['stop1', 'stop2', 'stop3']
    for include in includes:
        if include in url:
            for exclude in excludes:
                if exclude in url:
                    return False
            else:
                return True

urls = ['http://somewebsite.tld/word',
        'http://somewebsite.tld/word1',
        'http://somewebsite.tld/word1/stop3',
        'http://somewebsite.tld/word2',
        'http://somewebsite.tld/word2/stop2',
        'http://somewebsite.tld/word3',
        'http://somewebsite.tld/stop3/word1',
        'http://somewebsite.tld/stop4/word1']

filtered_url_list = filter(urlfilter, urls)

答案 1 :(得分:0)

如果你能引用一个例子那么它会有所帮助。如果我们举一个像

这样的网址的例子
def urlSearch():
    word = []
    end_words = ['gmail', 'finance']
    Key_word = ['google']
    urlList= ['google.com//d/gmail', 'google.com/finance', 'google.com/sports', 'google.com/search']
    for i in urlList:
        main_part = i.split('/',i.count('/'))
        if main_part[len(main_part) - 1] in end_words:
            word = []
            for k in main_part[:-1]:
                for j in k.split('.'):
                    word.append(j)
            print (word)
        for p in Key_word:
            if p in word:
                print ("Url is: " + i)

urlSearch()

答案 2 :(得分:-1)

我会使用集合和列表理解:

must_in = set([word1, word2])
musnt_in = set([stop1, stop2, stop3])
links = [x for x in url if must_in & set(x) and not (musnt_in & set(x))]
print links

上面的代码可用于任意数量的单词和句点,不限于两个单词(word1,word2)和三个单词(stop1,stop2,stop3)。