Question

我有两个数据框df1和df2。 df2由“标记名”和“值”列组成。字典“ bucket_dict”保存着df2中的数据。

bucket_dict = dict(zip(df2.tagname,df2.value))

在df1中，有数百万的row.3列在df1中有“ apptag”，“ comments”和“ Type”。我想在这两个数据帧之间进行匹配，如果

bucket_dict中的“字典关键字”包含在df1 [“ apptag”]中，然后更新df1 [“ comments”] =相应的字典关键字的值和df1 [“ Type”] =相应的bucket_dict [“键名”] 。我用下面的代码：

for each_tag in bucket_dict: 
    df1.loc[(df1["apptag"].str.match(each_tag, case = False ,na = False)), "comments"] =  each_tag
    df1.loc[(df1["apptag"].str.match(each_tag, case = False ,na = False)), "Type"] =  bucket_dict[each_tag]

有什么有效的方法可以做到这一点，因为它花费的时间更长。

在其中创建字典的df装箱：

bucketing_df = pd.DataFrame([["pen", "study"], ["pencil", "study"], ["ersr","study"],["rice","grocery"],["wht","grocery"]], columns=['tagname', 'value'])

其他数据框：

  output_df = pd.DataFrame([["test123-pen", "pen"," "], ["test234-pencil", "pencil"," "], ["test234-rice","rice", " "], columns=['apptag', 'comments','type'])

必填输出：

Answer 1

您可以通过在comments列上同时在loc上调用bucketing_df来申请-

def find_type(a):
    try:
        return (bucketing_df.loc[[x in a for x in bucketing_df['tagname']]])['value'].values[0]
    except:
        return ""

def find_comments(a):
    try:
        return (bucketing_df.loc[[x in a for x in bucketing_df['tagname']]])['tagname'].values[0]
    except:
        return ""


output_df['type'] = output_df['apptag'].apply(lambda a: find_type(a))
output_df['comments'] = output_df['apptag'].apply(lambda a:find_comments(a))

在这里，我必须将它们分别设置为函数，以便能够处理tagname中不存在apptag的情况

它为您提供output_df-

           apptag comments     type
0     test123-pen      pen    study
1  test234-pencil   pencil    study
2    test234-rice     rice  grocery

此代码使用的只是问题末尾提供的现有bucketing_df和output_df。

比较两个熊猫数据帧并根据条件更新一个数据帧的最有效方法

1 个答案: