我有一个点击流数据。我正在使用URL列来查找特殊事件。例如,如果URL包含关键字Dealer,则将创建新列“ Is Dealer”,它给出布尔值。
df示例:
词典: 我有一本字典,其中的键是“域”,值是关键字列表(关键字必须在URL中检查”
brand_dict = {'volkswagen': ['haendlersuche'], 'mercedes-benz': ['dealer-locator'], 'skoda-auto': ['dealers']}
我首先需要检查其他列中的2个条件:如果Domains列=“ BMW”并且它包含词典列表中的任何关键字,则它将在新列中提供布尔值。
问题是我必须创建3列并且我有3个字典。有什么特殊的方法吗?
到目前为止,我正在这样做:
def conv_attribution(domain, url):
list_output = []
if domain in dict_config.keys():
bolcheck1 = False
for keyword in dict_config[domain]:
if keyword in url:
bolcheck1 = True
bolcheck2 = False
for keyword in dict_dealer[domain]:
if keyword in url:
bolcheck2 = True
bolcheck3 = False
for keyword in dict_brand_keywords[domain]:
if keyword in url:
bolcheck3 = True
if bolcheck1 == True:
list_output.append(True)
else:
list_output.append(False)
if bolcheck2 == True:
list_output.append(True)
else:
list_output.append(False)
if bolcheck3 == True:
list_output.append(keyword)
else:
list_output.append("Nan")
return list_output
请帮助...
所需的输出
所需的外观如下所示,但在“模型名称”中,我要添加从URL提取的模型名称
答案 0 :(得分:0)
这是一个最小的例子
import pandas as pd
domains = ['bmw','smart','smart','fiat','bmw']
urls = ['https://bmw.com/hello','https://smart.com/world','https://smart.com/hello','https://fiat.com/hello','https://bmw.com/hello']
df = pd.DataFrame({'domain':domains,'urls':urls})
# your config dict
brand_dict = {'bmw': ['hello'], 'smart': ['world'],'fiat':['hello']}
样本df
domain urls
0 bmw https://bmw.com/hello
1 smart https://smart.com/world
2 smart https://smart.com/hello
3 fiat https://fiat.com/hello
4 bmw https://bmw.com/hello
创建新列
df['col_1'] = df.apply(lambda x: any(substring in x.urls for substring in brand_dict[x.domain]) ,axis =1)
df['col_2'] = df.apply(lambda x: any(substring in x.urls for substring in brand_dict[x.domain]) ,axis =1)
df
新df
domain urls col_1 col_2
0 bmw https://bmw.com/hello True True
1 smart https://smart.com/world True True
2 smart https://smart.com/hello False False
3 fiat https://fiat.com/hello True True
4 bmw https://bmw.com/hello True True