缺少标签的查询

时间:2018-11-21 17:17:36

标签: pandas lookup missing-data

我有一个代码,使用数据框在给定列标签(X)的情况下查找值(P):

df_1 = pd.DataFrame({'X': [1,2,3,1,1,2,1,3,2,1]})

df_2 = pd.DataFrame({ 1 : [1,2,3,4,1,2,3,4,1,2],
                      2 : [4,1,2,3,4,1,2,1,2,3],
                      3 : [2,3,4,1,2,3,4,1,2,5]})

df_1['P'] = df_2 .lookup(df_1.index, df_1['X'])

当我在df_1中给它添加标签但不在df_2中包含该标签时,如下所示:

df_1 = pd.DataFrame({'X': [7,2,3,1,1,2,1,3,2,1]})

我得到:

KeyError: 'One or more column labels was not found'

我如何跳过那些,以获得:

   X  P
0  7  NaN
1  2  1
2  3  4
3  1  4
4  1  1
5  2  1
6  1  3
7  3  1
8  2  2
9  1  2

5 个答案:

答案 0 :(得分:2)

document添加try ... except

result = []
for row, col in zip(df_1.index, df_1.X):
    try :
        result.append(df_2.loc[row, col])
    except :
        result.append(np.nan)

result
Out[135]: [nan, 1, 4, 4, 1, 1, 3, 1, 2, 2]

答案 1 :(得分:2)

get和默认值

def get_lu(df):
  def lu(i, j):
    return df.get(j, {}).get(i, np.nan)
  return lu

[*map(get_lu(df_2), df_1.index, df_1.X)]

[nan, 1, 4, 4, 1, 1, 3, 1, 2, 2]

替代

[df_2.get(j, {}).get(i, np.nan) for i, j in df_1.X.items()]

[nan, 1, 4, 4, 1, 1, 3, 1, 2, 2]

在一起

df_1.assign(P=[df_2.get(j, {}).get(i, np.nan) for i, j in df_1.X.items()])

   X    P
0  7  NaN
1  2  1.0
2  3  4.0
3  1  4.0
4  1  1.0
5  2  1.0
6  1  3.0
7  3  1.0
8  2  2.0
9  1  2.0

Uglier版本

df_1.assign(P=[df_2.rename_axis('X', 1).stack().get(x, np.nan) for x in df_1.X.items()])

   X    P
0  7  NaN
1  2  1.0
2  3  4.0
3  1  4.0
4  1  1.0
5  2  1.0
6  1  3.0
7  3  1.0
8  2  2.0
9  1  2.0

答案 2 :(得分:1)

比@piRSquared慢一点,但使用loc + lambda

>> df_1['P'] = df_1.apply(lambda x: df_2.loc[x.name, x.values[0]] if x.values[0] in df_2.columns else np.nan, axis=1)
>> df_1

    X   P
0   7   NaN
1   2   1.0
2   3   4.0
3   1   4.0
4   1   1.0
5   2   1.0
6   1   3.0
7   3   1.0
8   2   2.0
9   1   2.0

答案 3 :(得分:1)

此答案使用numpy且速度很快...

import numpy as np

设置数据框

df_1 = pd.DataFrame({'X': [7,2,3,1,1,2,1,3,2,1]})

df_2 = pd.DataFrame({ 1 : [1,2,3,4,1,2,3,4,1,2],
                      2 : [4,1,2,3,4,1,2,1,2,3],
                      3 : [2,3,4,1,2,3,4,1,2,5]})

-

# designate working columns
lookup_cols = [1, 2, 3]
key_col = 'X'
result_col = 'P'

# get key column values as an array
key = df_1[key_col].values

# make an array of nans to hold the lookup results
result = np.full(key.shape[0], np.nan)

# create a boolean array containing only valid lookup indexes
b = np.isin(key, lookup_cols)

# filter df_1 and df_2 with boolean array b
df_1b = df_1[b]
df_2b = df_2[b]

# lookup values using filtered dataframes
lup = df_2b.lookup(df_1b.index, df_1b[key_col])
# put the results into the result array at proper index locations using b
result[b] = lup
# assign the result array to the dataframe result column
df_1[result_col] = result

答案 4 :(得分:0)

如果我想使用df_1中的另一列而不是索引,那么piRSquared的答案变为:

df_1 = pd.DataFrame({'M' : ['X','Y','Z','X','Y','F','Y'],
                     'N' : ['A','C','B','B','A','A','F']})

df_2 = pd.DataFrame({'A' : [1,2,3],
                     'B' : [4,1,2],
                     'C' : [2,3,4]},
                     index = ['X', 'Y', 'Z'])

def get_lu(df):
  def lu(i, j):
    return df.get(j, {}).get(i, np.nan)
  return lu

df_1['O'] = [*map(get_lu(df_2), df_1.M, df_1.N)]

哪个给:

   M  N    O
0  X  A  1.0
1  Y  C  3.0
2  Z  B  2.0
3  X  B  4.0
4  Y  A  2.0
5  F  A  NaN
6  Y  F  NaN