从广播列表中删除元素

时间:2018-12-14 06:59:35

标签: python apache-spark

我有一个URL列表,例如:

www.google.com
www.yahoo.fr
www.stackoverflow.com

我要删除所有包含字符串“ oo”和“ flow”的URL。

我做了一个python函数:

def my_function(param1,param2, 
param3,param4,liste_to_delete,liste2_to_delete):
     status=True
     SQL_CONSTANT = "url not like '%"
     URL_SEP = ";"
     # getFirstList
     broadcastListe1String =""
     listtodelete = liste2_to_delete.split(URL_SEP)
     for url in listtodelete:
         broadcastListe1String = SQL_CONSTANT + url + "%'"
         if(listtodelete.index(url) != len(listtodelete) -1):
             broadcastListe1String = broadcastListe1String + " AND "
     my_broadcast = sc.broadcast(broadcastListe1String)

然后我做了:

DataFrame= my_DataFrame.where(my_broadcast.value)

此功能从列表中的第二个元素开始,不需要 强调列表中的第一个元素。

如何更改我的功能,是否还要删除列表中的第一个元素? 我希望我很清楚 谢谢

2 个答案:

答案 0 :(得分:1)

我认为您可以像这样使用filter函数:

filter(lambda x: 'oo' not in x and 'flow' not in x, lst)

例如:

lst = ['www.google.com',
       'www.yahoo.fr',
       'www.stackoverflow.com',
       'www.duckduck.com',
       'www.amazon.com',
      ]

filtered_lst = filter(lambda x: 'oo' not in x and 'flow' not in x, lst)
# filtered_lst = ['www.duckduck.com', 'www.amazon.com']

或:

lst = ['www.google.com',
       'www.yahoo.fr',
       'www.stackoverflow.com',
       'www.duckduck.com',
       'www.amazon.com',
      ]

ex_words = ['oo', 'flow']

filterd_lst = filter(lambda x: all(w not in x for w in ex_words), lst)
# filtered_lst = ['www.duckduck.com', 'www.amazon.com']

答案 1 :(得分:0)

filter = ['oo', 'flow']
list = ['www.google.com','www.yahoo.fr','www.stackoverflow.com','www.something.com']
for val in list:
    if not any(bad_word in val for bad_word in filter):
        print(val)

输出

www.something.com