Question

我有一个点击流数据。我正在使用URL列来查找特殊事件。例如，如果URL包含关键字Dealer，则将创建新列“ Is Dealer”，它给出布尔值。

df示例：

词典： 我有一本字典，其中的键是“域”，值是关键字列表（关键字必须在URL中检查”

brand_dict = {'volkswagen': ['haendlersuche'], 'mercedes-benz': ['dealer-locator'], 'skoda-auto': ['dealers']}

我首先需要检查其他列中的2个条件：如果Domains列=“ BMW”并且它包含词典列表中的任何关键字，则它将在新列中提供布尔值。

问题是我必须创建3列并且我有3个字典。有什么特殊的方法吗？

到目前为止，我正在这样做：

 def conv_attribution(domain, url):

        list_output = []

        if domain in dict_config.keys():


            bolcheck1 = False
            for keyword in dict_config[domain]:
                if keyword in url:
                    bolcheck1 = True

            bolcheck2 = False
            for keyword in dict_dealer[domain]:
                if keyword in url:
                    bolcheck2 = True  

            bolcheck3 = False
            for keyword in dict_brand_keywords[domain]:
                if keyword in url:
                      bolcheck3 = True


            if bolcheck1 == True:
                list_output.append(True)
            else:
                list_output.append(False)

            if bolcheck2 == True:
                list_output.append(True)
            else:
                list_output.append(False)

            if bolcheck3 == True:
                list_output.append(keyword)
            else:
                list_output.append("Nan")


   return list_output

请帮助...

所需的输出

所需的外观如下所示，但在“模型名称”中，我要添加从URL提取的模型名称

Answer 1

这是一个最小的例子

import pandas as pd
domains = ['bmw','smart','smart','fiat','bmw']
urls = ['https://bmw.com/hello','https://smart.com/world','https://smart.com/hello','https://fiat.com/hello','https://bmw.com/hello']
df = pd.DataFrame({'domain':domains,'urls':urls})
# your config dict
brand_dict = {'bmw': ['hello'], 'smart': ['world'],'fiat':['hello']}

样本df

    domain  urls
0   bmw     https://bmw.com/hello
1   smart   https://smart.com/world
2   smart   https://smart.com/hello
3   fiat    https://fiat.com/hello
4   bmw     https://bmw.com/hello

创建新列

df['col_1'] = df.apply(lambda x: any(substring in x.urls for substring in brand_dict[x.domain]) ,axis =1)
df['col_2'] = df.apply(lambda x: any(substring in x.urls for substring in brand_dict[x.domain]) ,axis =1)
df

新df

   domain   urls                    col_1   col_2
0   bmw     https://bmw.com/hello   True    True
1   smart   https://smart.com/world True    True
2   smart   https://smart.com/hello False   False
3   fiat    https://fiat.com/hello  True    True
4   bmw     https://bmw.com/hello   True    True

除了“在熊猫中发挥作用”之外，还有哪种更好或替代的方法？

1 个答案: