如何比较两个数据框的列以添加映射

时间:2019-04-10 07:37:51

标签: python pandas

我有两个数据帧,如下所示,并且我正在尝试根据数据帧2的ITEM / CODE将值设置为数据帧1中的CLASSIFICATION。如果DESC对ITEM / CODE有任何匹配的单词,那么我必须从dataframe2获取TYPE。

要实现此目的,我拆分了DESC字符串,并尝试将列表条目与dataframe2的ITEM / CODE比较。有关如何执行此操作的任何想法?

dataframe1
PN      DESC                                        CLASSIFICATION
C23890  Resistor 2.21K elec
C23891  Powerswitch
C23892  Resistor 7.5K
C23893  Resistor .1K
C23894  FET elec
C23895  ELE SD Card adapter
C23896  Crystal 16Mhz
C23897  Capacitor 100uF
C23898  ELECTRONICS Resistor 10K
C23899  M3x5 Socket Cap Bolt MECH
C23900  M3x6 Socket Cap Bolt Mech
C23901  Mehcanical Assemble Kapton Tape 120mm
C23902  MK7 Filament Drive Block Front
C23903  Pulley 5mm shaft

dataframe2
ITEM/CODE      TYPE
ELE         ELECTRONIC
ELECTRONICS ELECTRONIC
Capacitor   ELECTRONIC
Resistor    ELECTRONIC
Washer      MECHANICAL
MECH        MECHANICAL

这是我到目前为止编写的代码。


import pandas as pd

fn = 'D:\PartsExport.xlsx'
dfInput = pd.read_excel(fn, 'Sheet1')

fn_type = 'D:\TypeMaster.xlsx'
dfType = pd.read_excel(fn_type, 'Sheet1')

dfInput['DESC_SPLIT'] = dfInput["DESC"].str.split(" ", n=-1, expand = False)

Result

PN      DESC                      CLASSIFICATION
C23890  Resistor 2.21K elec         ELECTRONIC
C23891  Powerswitch                 ELECTRONIC
C23892  Resistor 7.5K               ELECTRONIC
C23893  Resistor .1K                ELECTRONIC
C23899  M3x5 Socket Cap Bolt MECH   MECHANICAL

2 个答案:

答案 0 :(得分:1)

Series.str.containsdataframe2创建的按系列循环一起使用,其中flags=re.I参数用于不区分大小写:

import re

for k, v in dataframe2.set_index('ITEM/CODE')['TYPE'].items():
    #if necessary word boundaries
    pat = r"\b{}\b".format(k)
    #if not
    #pat = k
    dataframe1.loc[dataframe1['DESC'].str.contains(pat, flags=re.I), 'CLASSIFICATION'] = v

print (dataframe1)
        PN                                   DESC CLASSIFICATION
0   C23890                    Resistor 2.21K elec     ELECTRONIC
1   C23891                            Powerswitch            NaN
2   C23892                          Resistor 7.5K     ELECTRONIC
3   C23893                           Resistor .1K     ELECTRONIC
4   C23894                               FET elec            NaN
5   C23895                    ELE SD Card adapter     ELECTRONIC
6   C23896                          Crystal 16Mhz            NaN
7   C23897                        Capacitor 100uF     ELECTRONIC
8   C23898               ELECTRONICS Resistor 10K     ELECTRONIC
9   C23899              M3x5 Socket Cap Bolt MECH     MECHANICAL
10  C23900              M3x6 Socket Cap Bolt Mech     MECHANICAL
11  C23901  Mehcanical Assemble Kapton Tape 120mm            NaN
12  C23902         MK7 Filament Drive Block Front            NaN
13  C23903                       Pulley 5mm shaft            NaN

如果只想匹配第一个单词,请使用Series.map,但首先将两个值都用Series.str.lower转换为小写:

dataframe2['ITEM/CODE'] = dataframe2['ITEM/CODE'].str.lower()
s = dataframe2.set_index('ITEM/CODE')['TYPE']

dataframe1['CLASSIFICATION'] = dataframe1['DESC'].str.split().str[0].str.lower().map(s)
print (dataframe1)
        PN                                   DESC CLASSIFICATION
0   C23890                    Resistor 2.21K elec     ELECTRONIC
1   C23891                            Powerswitch            NaN
2   C23892                          Resistor 7.5K     ELECTRONIC
3   C23893                           Resistor .1K     ELECTRONIC
4   C23894                               FET elec            NaN
5   C23895                    ELE SD Card adapter     ELECTRONIC
6   C23896                          Crystal 16Mhz            NaN
7   C23897                        Capacitor 100uF     ELECTRONIC
8   C23898               ELECTRONICS Resistor 10K     ELECTRONIC
9   C23899              M3x5 Socket Cap Bolt MECH            NaN
10  C23900              M3x6 Socket Cap Bolt Mech            NaN
11  C23901  Mehcanical Assemble Kapton Tape 120mm            NaN
12  C23902         MK7 Filament Drive Block Front            NaN
13  C23903                       Pulley 5mm shaft            NaN

答案 1 :(得分:1)

不那么花哨,但应该去做:

import pandas as pd

#convert dfType dataframe to dictionary
type_dict = dfType.set_index('ITEM/CODE').T.to_dict()

#function that takes in DESC column value and outputs corresponding value from type_dict
def map_type(in_str):
  out_str = np.NaN
  for val in in_str.split():
      if val in type_dict.keys():
         out_str = type_dict[val]['TYPE']
  return out_str

#apply above function to DESC column
dfInput['CLASSIFICATION'] = dfInput['DESC'].apply(map_type)