将字典键映射到熊猫df

时间:2019-11-06 05:15:29

标签: python pandas dataframe

我目前正在将dict键映射到单独列中基于列的值。我正在使用Code中的值来匹配字典中的值,并将键复制到单独的列中。因此,新列将包含数字1,2,3

除了在同一时间戳上的多个代码值外,此方法工作正常。 我只希望每个唯一时间戳都有一个映射值。

如果在同一时间戳上有多个值,但是映射的数字将是相同的(A,B),则只需取第一个值。我可以为此使用.drop_duplicates

但是,如果在同一时间点映射的号码不同,我想删除2并选择1.drop_duplicates仅在1之前在2之前有效,反之亦然

import pandas as pd
from fuzzywuzzy import process

df = pd.DataFrame({   
        'Time' : ['2019-08-02 09:50:10.1','2019-08-02 09:50:10.1','2019-08-02 09:50:10.2','2019-08-02 09:50:10.3','2019-08-02 09:50:10.4','2019-08-02 09:50:10.4','2019-08-02 09:50:10.5','2019-08-02 09:50:10.5','2019-08-02 09:50:10.6','2019-08-02 09:50:10.6'],
        'Code' : ['A','C','X','Y','A','B','X','A','Z','L'],                                   
        })

# Dictionary that contains how to map numbers
hdict = {'1' : ['A', 'B'],

    '2' : ['X','Y','Z'],

    '3' : ['D']}

def hColumn(df):

    # Construct a dataframe from the helper dictionary
    df1 = pd.DataFrame([*hdict.values()], index = hdict.keys()).T.melt().dropna()   
    # Get relevant matches using the library.
    m = df['Code'].apply(lambda x: process.extract(x, df1.value)[0])
    # Concat the matches with original df
    df2 = pd.concat([df, m[m.apply(lambda x: x[1]>80)].apply(lambda x: x[0])], axis=1)
    df2.columns = [*df.columns, 'matches']
    # After merge it with df1
    df2 = df2.merge(df1, left_on='matches', right_on='value', how='left')
    # Drop columns that are not required and rename.
    df2 = df2.drop(['matches','value'],1).rename(columns={'variable':'H'})
    # Drop unwanted rows
    df2 = df2.mask(df2['H'].isna())
    df2 = df2.dropna(subset = ['H'])

    return df2

df = hColumn(df)

预期输出:

                    Time Code  H
0  2019-08-02 09:50:10.1    A  1
1  2019-08-02 09:50:10.2    X  2
2  2019-08-02 09:50:10.3    Y  2
3  2019-08-02 09:50:10.4    A  1
4  2019-08-02 09:50:10.5    A  1
5  2019-08-02 09:50:10.6    Z  2

3 个答案:

答案 0 :(得分:1)

如果我是我,我将“撤消”您的字典,它将非常简化所有后续操作,我的解决方法如下,请告诉我它是否有帮助,并随时提出问题:

rawValue

答案 1 :(得分:1)

使用DataFrame.drop_duplicates

df = df.drop_duplicates('Time')

如果可能,这是另一种解决方案-它通过Series.map匹配值:

#swap key values in dict
#http://stackoverflow.com/a/31674731/2901002
d = {k: oldk for oldk, oldv in hdict.items() for k in oldv}
df["H"] = df['Code'].map(d)
df = df.dropna(subset=['H']).drop_duplicates('Time')
print (df)
                    Time Code  H
0  2019-08-02 09:50:10.1    A  1
2  2019-08-02 09:50:10.2    X  2
3  2019-08-02 09:50:10.3    Y  2
4  2019-08-02 09:50:10.4    A  1
6  2019-08-02 09:50:10.5    X  2
8  2019-08-02 09:50:10.6    Z  2

答案 2 :(得分:1)

from flashtext import KeywordProcessor
kp = KeywordProcessor()
kp.add_keywords_from_dict(hdict)

df['H'] = df['Code'].apply(lambda x : kp.extract_keywords(x))

df['H'] = df['H'].apply(lambda x: pd.Series(x[0]) if x else pd.Series())
df.dropna(inplace = True)
df

enter image description here