Question

我正在努力编写一段代码来实现/克服以下问题。

我有两个Excel电子表格。让我们举个例子

DF1 -  1. Master Data 
DF2 - 2. consumer details.

我需要迭代“消费者”详细信息中的描述列，其中包含主数据表中的字符串或子字符串，并返回相邻的值。我知道，它非常简单直接，但无法成功。

我在Excel中使用索引匹配-

INDEX('Path\[Master Sheet.xlsx]Master 
List'!$B$2:$B$199,MATCH(TRUE,ISNUMBER(SEARCH('path\[Master 
Sheet.xlsx]Master List'!$A$2:$A$199,B3)),0))

但是需要使用Python / Pandas解决方案-

Eg Df1 - Master SheetMaster Sheet -
Store       Category
Nike        Shoes
GAP         Clothing
Addidas     Shoes
Apple       Electronics
Abercrombie Clothing
Hollister       Clothing
Samsung     Electornics
Netflix     Movies  

etc.....

df2 - Consumer Sheet-

Date     Description    Amount   Category
01/01/20  GAP Stores    1.1     
01/01/20  Apple Limited 1000
01/01/20  Aber fajdfal  50
01/01/20  hollister das 20
01/01/20  NETFLIX.COM   10  
01/01/20  GAP Kids      5.6

现在，我需要根据消费者表中的description（字符串/子字符串）列引用主表中的stores列来更新消费者表中的Category列

任何投入/建议，深表感谢。

Answer 1

一种选择是制作一个自定义函数，该自定义函数循环遍历Df1值，以使商店与作为参数提供的字符串匹配。如果找到匹配项，它将return关联的类别字符串，如果找不到return None或其他一些默认值。您可以使用str.lower来增加找到匹配项的机会。然后，您使用pandas.Series.apply将此功能应用于您要尝试查找匹配项的列。

import pandas as pd

df1 = pd.DataFrame(dict(
    Store = ['Nike','GAP','Addidas','Apple','Abercrombie'],
    Category = ['Shoes','Clothing','Shoes','Electronics','Clothing'],
))

df2 = pd.DataFrame(dict(
    Date = ['01/01/20','01/01/20','01/01/20'],
    Description = ['GAP Stores','Apple Limited','Aber fajdfal'],
    Amount = [1.1,1000,50],
))

def get_cat(x):
    global df1
    for store, cat in df1[['Store','Category']].values:
        if store.lower() in x.lower():
            return cat

df2['Category'] = df2['Description'].apply(get_cat)

print(df2)

输出：

       Date    Description  Amount     Category
0  01/01/20     GAP Stores     1.1     Clothing
1  01/01/20  Apple Limited  1000.0  Electronics
2  01/01/20   Aber fajdfal    50.0         None

Python tutor link to example

我应该注意，如果应该将'Aber fajdfal'与'Abercrombie'匹配，则此解决方案将无法工作。您需要向该函数添加更复杂的逻辑，以匹配这样的部分字符串。

熊猫-查找字符串并返回匹配数据的相邻值

1 个答案: