以字典形式检查数据帧值(作为键,值元组)的矢量化方法?

时间:2018-03-15 18:09:43

标签: python pandas dictionary dataframe boolean

我想在我的数据框中创建一个列,用于检查一列中的值是否是包含字典键的另一个列的字典值,如下所示:

In [3]:
df = pd.DataFrame({'Model': ['Corolla', 'Civic', 'Accord', 'F-150'],
                   'Make': ['Toyota', 'Honda', 'Toyota', 'Ford']})
dic = {'Prius':'Toyota', 'Corolla':'Toyota', 'Civic':'Honda', 
       'Accord':'Honda', 'Odyssey':'Honda', 'F-150':'Ford', 
       'F-250':'Ford', 'F-350':'Ford'}
df

Out [3]:
     Model    Make
0  Corolla  Toyota
1    Civic   Honda
2   Accord  Toyota
3    F-150    Ford

在应用函数或其他任何函数后,我想看看:

Out [10]:
     Model    Make   match
0  Corolla  Toyota    TRUE
1    Civic   Honda    TRUE
2   Accord  Toyota   FALSE
3    F-150    Ford    TRUE

提前致谢!

编辑:我尝试创建一个传递一个元组的函数,这个函数将是两列,但我认为我没有正确传递参数:

def is_match(make, model):
  try:
    has_item = dic[make] == model
  except KeyError:
    has_item = False
  return(has_item)

df[['Model', 'Make']].apply(is_match)

results in:
TypeError: ("is_match() missing 1 required positional 
argument: 'model'", 'occurred at index Model')

3 个答案:

答案 0 :(得分:5)

您可以使用map

df.assign(match=df.Model.map(dic).eq(df.Make))
Out[129]: 
     Make    Model  match
0  Toyota  Corolla   True
1   Honda    Civic   True
2  Toyota   Accord  False
3    Ford    F-150   True

答案 1 :(得分:3)

理解

  func completeOffset(from date:Date) -> String? {

    let formatter = DateComponentsFormatter()
    formatter.unitsStyle = .brief

    return  formatter.string(from: Calendar.current.dateComponents([.year,.month,.day,.hour,.minute,.second], from: date, to: self))




}

df.assign(match=[dic.get(md, '') == mk for mk, md in df.values]) Make Model match 0 Toyota Corolla True 1 Honda Civic True 2 Toyota Accord False 3 Ford F-150 True dict.items

in

items = dic.items() df.assign(match=[t[::-1] in items for t in map(tuple, df.values)]) Make Model match 0 Toyota Corolla True 1 Honda Civic True 2 Toyota Accord False 3 Ford F-150 True

isin

Numpy Structured Arrays

df.assign(match=pd.Series(list(map(tuple, df.values[:, ::-1]))).isin(dic.items()))

     Make    Model  match
0  Toyota  Corolla   True
1   Honda    Civic   True
2  Toyota   Accord  False
3    Ford    F-150   True

时间比较

Conlcusions

@ wen的方法要好一个数量级!

功能

dtype = [('Make', '<U6'), ('Model', '<U7')]
a = np.array([tuple(r) for r in df.values], dtype)
b = np.array(list(dic.items()), dtype[::-1])

df.assign(match=np.in1d(a, b))

     Make    Model  match
0  Toyota  Corolla   True
1   Honda    Civic   True
2  Toyota   Accord  False
3    Ford    F-150   True

回测

def wen(df, dic):
    return df.assign(match=df.Model.map(dic).eq(df.Make))

def maxu(df, dic):
    return df.assign(match=df[['Make', 'Model']].sum(axis=1).isin(set([v+k for k, v in dic.items()])))

def pir1(df, dic):
    return df.assign(match=[dic.get(md, '') == mk for mk, md in df.values])

def pir2(df, dic):
    items = dic.items()
    return df.assign(match=[t[::-1] in items for t in map(tuple, df.values)])

def pir3(df, dic):
    return df.assign(match=pd.Series(list(map(tuple, df.values[:, ::-1]))).isin(dic.items()))

def pir4(df, dic):
    dtype = [('Make', '<U6'), ('Model', '<U7')]
    a = np.array([tuple(r) for r in df.values], dtype)
    b = np.array(list(dic.items()), dtype[::-1])

    return df.assign(match=np.in1d(a, b))

结果

res = pd.DataFrame(
    np.nan, [10, 30, 100, 300, 1000, 3000, 10000, 30000],
    'wen maxu pir1 pir2 pir3 pir4'.split()
)

for i in res.index:
    m = dict(dic.items())
    d = pd.concat([df] * i, ignore_index=True)
    for j in res.columns:
        stmt = f'{j}(d, m)'
        setp = f'from __main__ import {j}, m, d'
        res.at[i, j] = timeit(stmt, setp, number=200)

enter image description here

res.plot(loglog=True)

答案 2 :(得分:2)

又一个选择:

In [38]: df['match'] = df[['Make','Model']] \
                         .sum(axis=1) \
                         .isin(set([v+k for k,v in dic.items()]))

In [39]: df
Out[39]:
     Make    Model  match
0  Toyota  Corolla   True
1   Honda    Civic   True
2  Toyota   Accord  False
3    Ford    F-150   True