我有一个网址列表,我正在尝试使用特定关键字过滤它们,例如word1和word2,停止词列表说[stop1,stop2,stop3]。有没有办法过滤链接而不使用很多if条件?当我在每个停止词上使用if条件时,我得到了正确的输出,这看起来不是一个可行的选项。以下是暴力方法:
for link in url:
if word1 or word2 in link:
if stop1 not in link:
if stop2 not in link:
if stop3 not in link:
links.append(link)
答案 0 :(得分:1)
如果我遇到你的情况,我会考虑几个选项。
您可以使用内置any
和all
功能的列表理解来过滤掉列表中不需要的网址:
urls = ['http://somewebsite.tld/word',
'http://somewebsite.tld/word1',
'http://somewebsite.tld/word1/stop3',
'http://somewebsite.tld/word2',
'http://somewebsite.tld/word2/stop2',
'http://somewebsite.tld/word3',
'http://somewebsite.tld/stop3/word1',
'http://somewebsite.tld/stop4/word1']
includes = ['word1', 'word2']
excludes = ['stop1', 'stop2', 'stop3']
filtered_url_list = [url for url in urls if any(include in url for include in includes) if all(exclude not in url for exclude in excludes)]
或者你可以创建一个以一个url作为参数的函数,并为你想保留的url返回True
,为你不保留的那些返回False
,然后将该函数与未经过滤的内置filter
函数的网址列表:
def urlfilter(url):
includes = ['word1', 'word2']
excludes = ['stop1', 'stop2', 'stop3']
for include in includes:
if include in url:
for exclude in excludes:
if exclude in url:
return False
else:
return True
urls = ['http://somewebsite.tld/word',
'http://somewebsite.tld/word1',
'http://somewebsite.tld/word1/stop3',
'http://somewebsite.tld/word2',
'http://somewebsite.tld/word2/stop2',
'http://somewebsite.tld/word3',
'http://somewebsite.tld/stop3/word1',
'http://somewebsite.tld/stop4/word1']
filtered_url_list = filter(urlfilter, urls)
答案 1 :(得分:0)
如果你能引用一个例子那么它会有所帮助。如果我们举一个像
这样的网址的例子def urlSearch():
word = []
end_words = ['gmail', 'finance']
Key_word = ['google']
urlList= ['google.com//d/gmail', 'google.com/finance', 'google.com/sports', 'google.com/search']
for i in urlList:
main_part = i.split('/',i.count('/'))
if main_part[len(main_part) - 1] in end_words:
word = []
for k in main_part[:-1]:
for j in k.split('.'):
word.append(j)
print (word)
for p in Key_word:
if p in word:
print ("Url is: " + i)
urlSearch()
答案 2 :(得分:-1)
我会使用集合和列表理解:
must_in = set([word1, word2])
musnt_in = set([stop1, stop2, stop3])
links = [x for x in url if must_in & set(x) and not (musnt_in & set(x))]
print links
上面的代码可用于任意数量的单词和句点,不限于两个单词(word1,word2)和三个单词(stop1,stop2,stop3)。