Question

我是熊猫和数据可视化的新手。我正在处理一些OkCupid数据集，并希望处理一些数据。我有一列“教育”，其中有几个选项：

['graduated from college/university', 'graduated from masters program',
       'working on college/university', 'working on masters program',
       'graduated from two-year college', 'graduated from high school',
       'graduated from ph.d program', 'graduated from law school',
       'working on two-year college', 'dropped out of college/university',
       'working on ph.d program', 'college/university',
       'graduated from space camp', 'dropped out of space camp',
       'graduated from med school', 'working on space camp',
       'working on law school', 'two-year college', 'working on med school',
       'dropped out of two-year college', 'dropped out of masters program',
       'masters program', 'dropped out of ph.d program',
       'dropped out of high school', 'high school', 'working on high school',
       'space camp', 'ph.d program', 'law school', 'dropped out of law school',
       'dropped out of med school', 'med school']

我想通过以下字典将它们结合起来，以便能够更方便地绘制它们：

education_cats = {
    'High-school student' : ['dropped out of high school', 'working on high school'],
    'Ungraduated' : ['graduated from high school', 'dropped out of college/university', 'dropped out of space camp', 
                     'dropped out of two-year college', 'high school', 'dropped out of law school','dropped out of med school'],
    'Student' : ['working on college/university', 'working on two-year college', 'working on law school', 'working on med school'],
    'Graduated' : ['graduated from college/university', 'graduated from two-year college', 'graduated from law school',  
                   'college/university', 'graduated from space camp', 'working on space camp', 'graduated from med school', 
                   'two-year college', 'dropped out of masters program', 'space camp', 'law school' 'med school'],
    '2nd-degree student' : ['working on masters program'],
    'Master' : ['graduated from masters program', 'masters program', 'dropped out of ph.d program'],
    '3rd-degree student' : ['working on ph.d program'],
    'P.hd' : ['graduated from ph.d program', 'ph.d program']
}

我已经尝试过这种方式：

def find_key(value):
    for k in education_cats.keys():
        if value in education_cats[k]:
            return k
    return np.nan
df['education_category'] = df['education'].map(find_key, na_action='ignore')

有没有内置的熊猫可以做到这一点？还是这是最好的埃福德？

Answer 1

在Series列中说列表studies。您可以在第一个空格处分割，然后将值相应地添加到defaultdict中：

l = df.studies.str.split(' ',1, expand=True).values.tolist()

from collections import defaultdict
d = defaultdict(list)
for i in l:
    d[i[0]].append(i[1])

print(d)

defaultdict(list,
            {'graduated': ['from college/university',
              'from masters program',
              'from two-year college',
              'from high school',
              'from ph.d program',
              'from law school',
              'from space camp',
              'from med school'],
             'working': ['on college/university',
              'on masters program',
              'on two-year college',
              'on ph.d program',
              'on space camp',
              'on law school',
              'on med school',
              'on high school'],
             'dropped': ['out of college/university',
              'out of space camp',
              'out of two-year college',
              'out of masters program',
              'out of ph.d program',
              'out of high school',
              'out of law school',
              'out of med school'],
             'college/university': [None],
             'two-year': ['college'],
             'masters': ['program'],
             'high': ['school'],
             'space': ['camp'],
             'ph.d': ['program'],
             'law': ['school'],
             'med': ['school']})

Answer 2

使用值而不是列表作为键来构建字典会更加容易。

education_cats = {
    'High-school student' : ['dropped out of high school', 'working on high school'],
    'Ungraduated' : ['graduated from high school', 'dropped out of college/university', 'dropped out of space camp', 
                     'dropped out of two-year college', 'high school', 'dropped out of law school','dropped out of med school'],
    'Student' : ['working on college/university', 'working on two-year college', 'working on law school', 'working on med school'],
    'Graduated' : ['graduated from college/university', 'graduated from two-year college', 'graduated from law school',  
                   'college/university', 'graduated from space camp', 'working on space camp', 'graduated from med school', 
                   'two-year college', 'dropped out of masters program', 'space camp', 'law school' 'med school'],
    '2nd-degree student' : ['working on masters program'],
    'Master' : ['graduated from masters program', 'masters program', 'dropped out of ph.d program'],
    '3rd-degree student' : ['working on ph.d program'],
    'P.hd' : ['graduated from ph.d program', 'ph.d program']
}

cats = {}
for cat, l in education_cats.items():
    for item in l:
        cats[item] = cat

现在您可以使用具有默认值的apply或```map``

default_value = 'Unknown'

df['education_category'] = df['education'].apply(lambda x: cats.get(x, default_value)

df['education_category'] = df['education'].map(cats).fillna(default_value)

熊猫中的组类别

2 个答案: