匹配“熊猫”列中的单词,以及在基于匹配项创建新列时

时间:2019-07-21 18:02:45

标签: pandas python-2.7 csv numpy python-2.6

我有一个多列的pandas数据框和一个以键和值作为列表的字典。在df中,一栏代表说明,我需要查看此说明,并检查其是否与字典列表中的值之一匹配。

这是字典的摘录:

clothing_types = {'T-Shirt': ['t-shirt', 'shirt', 'tee'],
          'Tank Top': ['tank top', 'mesh', 'top', 'tank'],
          'Socks': ['socks'],
          'Hat': ['cap'],
          'Trainers': ['trainers', 'snickers', 'shoes', 'furylite 
          contemporary'}

这是列:

0       UNDER ARMOUR LADIES FLY-BY STRETCH MESH TANK TOP
1            UNDER ARMOUR LADIES SPEEDFORM NO SHOW SOCKS
2            UNDER ARMOUR LADIES SPEEDFORM NO SHOW SOCKS
3                     UNDER ARMOUR LADIES PLAY UP SHORTS
4             REEBOK LADIES CLASSIC LEATHER MID TRAINERS
5      UNDER ARMOUR MENS Spring Performance Oxford SHIRT
6       UNDER ARMOUR LADIES HEATGEAR ALPHA SHORTY SHORTS
7                                 ADIDAS LADIES PRO TANK
8                REEBOK LADIES ONE SERIES V NECK T-SHIRT
9                              REEBOK LADIES DF LONG BRA
10                     NIKE LADIES BASELINE TENNIS SKIRT
11              UNDER ARMOUR MENS ESCAPE 7" SOLID SHORTS
12      UNDER ARMOUR LADIES FLY-BY STRETCH MESH TANK TOP

我可以通过普通的for循环进行比较:

for item in self.original_file['Product Description'].tolist():
    found = False
    for item_type, type_descriptions in clothing_types.items():
        for description in type_descriptions:
            if description.upper() in item.upper():
                # print(item_type, item)
                found = True
                break

    if not found:
        print('NOT FOUND', item)

并尝试使用np.where:

for item_type, type_descriptions in clothing_types.items():
    for description in type_descriptions:
        self.original_file['Category'] = np.where(description.upper() in self.original_file['Product Description'], item_type, 'None')

但是它将值替换为最后一个值比较,这使列值始终为无

期望的是,如果在描述中让“ SHIRT”说“ SHIRT”(这是字典的关键),则会在新列-Category

3 个答案:

答案 0 :(得分:0)

我们可以用str.contains检查是否找到任何匹配项。如果获得成功,则填写字典的key,否则不填。最后,我们将所有空格和匹配项删除为一列:

matches = [np.where(df['Product Description'].str.contains('|'.join(v), case=False), 
                    k, 
                    '') for k, v in clothing_types.items()]

matches_df = pd.DataFrame(matches).T.sum(axis=1).to_frame('Matches')

df = df.join(matches_df)

输出

                                  Product Description   Matches
0    UNDER ARMOUR LADIES FLY-BY STRETCH MESH TANK TOP  Tank Top
1         UNDER ARMOUR LADIES SPEEDFORM NO SHOW SOCKS     Socks
2         UNDER ARMOUR LADIES SPEEDFORM NO SHOW SOCKS     Socks
3                  UNDER ARMOUR LADIES PLAY UP SHORTS          
4          REEBOK LADIES CLASSIC LEATHER MID TRAINERS  Trainers
5   UNDER ARMOUR MENS Spring Performance Oxford SHIRT   T-Shirt
6    UNDER ARMOUR LADIES HEATGEAR ALPHA SHORTY SHORTS          
7                              ADIDAS LADIES PRO TANK  Tank Top
8             REEBOK LADIES ONE SERIES V NECK T-SHIRT   T-Shirt
9                           REEBOK LADIES DF LONG BRA          
10                  NIKE LADIES BASELINE TENNIS SKIRT          
11           UNDER ARMOUR MENS ESCAPE 7" SOLID SHORTS       Hat
12   UNDER ARMOUR LADIES FLY-BY STRETCH MESH TANK TOP  Tank Top

答案 1 :(得分:0)

这可行,但不确定这是否是最佳解决方案

for i in self.original_file.index:
    for item_type, type_descriptions in clothing_types.items():
        for description in type_descriptions:
            if description.upper() in self.original_file.iloc[i]['Product Description'].upper():
                self.original_file.at[i, 'Category'] = item_type

答案 2 :(得分:0)

首先,您应该像这样在服装类型字典中在键和值之间进行切换

lothing_types2 = dict(list(itertools.chain(*[[(y_, x) for y_ in y] for x, y in clothing_types.items()])))

reference

然后,创建一个函数以按行搜索(如果在创建的新字典中有任何单词)

def to_category(x):
    for w in x.lower().split(" "):
        if w in clothing_types2:
            return clothing_types2[w]
    return None

最后,将方法应用于列,然后将结果保存到新的方法:

df["Category"] = df["clothing"].apply(to_category)