我下面有一个大的数据框:
在此处{edu_val.csv}中用作示例的数据可以在https://github.com/ENLK/Py-Projects-/blob/master/education_val.csv
中找到import pandas as pd
edu = pd.read_csv('education_val.csv')
del edu['Unnamed: 0']
edu.head(10)
ID Year Education
22445 1991 higher education
29925 1991 No qualifications
76165 1991 No qualifications
223725 1991 Other
280165 1991 intermediate qualifications
333205 1991 No qualifications
387605 1991 higher education
541285 1991 No qualifications
541965 1991 No qualifications
599765 1991 No qualifications
列Education
中的值是:
edu.Education.value_counts()
intermediate qualifications 153705
higher education 67020
No qualifications 55842
Other 36915
我想通过以下方式替换“教育”列中的值:
如果一个ID
在higher education
列中的年份中的值为Education
,则该ID
的所有未来年份也将具有{{1} }在higher education
列中。
如果一个Education
在一年中的值为ID
,那么该intermediate qualifications
的所有未来年份将在相应的{{1}中包含ID
}列。但是,如果值intermediate qualifications
在此Education
的任何后续年份中出现,则higher education
在随后的年份中替换ID
,无论higher education
还是intermediate qualifications
。
例如,在下面的数据框中,Other
年中的No qualifications occur
的值为ID
,higher education
的所有后续1991
值应为在以后的年份(直到Education
年之前,都用22445
替换。
higher education
类似地,以下数据框中的2017
1587125在年份edu.loc[edu['ID'] == 22445]
ID Year Education
22445 1991 higher education
22445 1992 higher education
22445 1993 higher education
22445 1994 higher education
22445 1995 higher education
22445 1996 intermediate qualifications
22445 1997 intermediate qualifications
22445 1998 Other
22445 1999 No qualifications
22445 2000 intermediate qualifications
22445 2001 intermediate qualifications
22445 2002 intermediate qualifications
22445 2003 intermediate qualifications
22445 2004 intermediate qualifications
22445 2005 intermediate qualifications
22445 2006 intermediate qualifications
22445 2007 intermediate qualifications
22445 2008 intermediate qualifications
22445 2010 intermediate qualifications
22445 2011 intermediate qualifications
22445 2012 intermediate qualifications
22445 2013 intermediate qualifications
22445 2014 intermediate qualifications
22445 2015 intermediate qualifications
22445 2016 intermediate qualifications
22445 2017 intermediate qualifications
中具有值ID
,在intermediate qualifications
中变为1991
。未来几年(从1993年开始)higher education
中1993
列中的所有后续值都应为Education
。
1587125
数据中有12,057个唯一的higher education
,列edu.loc[edu['ID'] == 1587125]
ID Year Education
1587125 1991 intermediate qualifications
1587125 1992 intermediate qualifications
1587125 1993 higher education
1587125 1994 higher education
1587125 1995 higher education
1587125 1996 higher education
1587125 1997 higher education
1587125 1998 higher education
1587125 1999 higher education
1587125 2000 higher education
1587125 2001 higher education
1587125 2002 higher education
1587125 2003 higher education
1587125 2004 Other
1587125 2005 No qualifications
1587125 2006 intermediate qualifications
1587125 2007 intermediate qualifications
1587125 2008 intermediate qualifications
1587125 2010 intermediate qualifications
1587125 2011 higher education
1587125 2012 higher education
1587125 2013 higher education
1587125 2014 higher education
1587125 2015 higher education
1587125 2016 higher education
1587125 2017 higher education
的范围是1991年至2017年。如何根据以上所述更改所有12,057个ID
的值条件?我不确定如何针对所有唯一的Year
以统一的方式执行此操作。此处用作示例的示例数据附在上面的Github链接中。预先非常感谢。
答案 0 :(得分:2)
您可以使用categorical data来做到这一点:
df = pd.read_csv('https://raw.githubusercontent.com/ENLK/Py-Projects-/master/education_val.csv')
eddtype = pd.CategoricalDtype(['No qualifications',
'Other',
'intermediate qualifications',
'higher education'],
ordered=True)
df['EducationCat'] = df['Education'].str.strip().astype(eddtype)
df['EduMax'] = df.sort_values('Year').groupby('ID')['EducationCat']\
.transform(lambda x: eddtype.categories[x.cat.codes.cummax()] )
它被明确地分解了,所以您可以看到我正在使用的数据操作。
输出:
df[df['ID'] == 1587125]
ID Year Education EducationCat EduMax
18 1587125 1991 intermediate qualifications intermediate qualifications intermediate qualifications
12075 1587125 1992 intermediate qualifications intermediate qualifications intermediate qualifications
24132 1587125 1993 higher education higher education higher education
36189 1587125 1994 higher education higher education higher education
48246 1587125 1995 higher education higher education higher education
60303 1587125 1996 higher education higher education higher education
72360 1587125 1997 higher education higher education higher education
84417 1587125 1998 higher education higher education higher education
96474 1587125 1999 higher education higher education higher education
108531 1587125 2000 higher education higher education higher education
120588 1587125 2001 higher education higher education higher education
132645 1587125 2002 higher education higher education higher education
144702 1587125 2003 higher education higher education higher education
156759 1587125 2004 Other Other higher education
168816 1587125 2005 No qualifications No qualifications higher education
180873 1587125 2006 intermediate qualifications intermediate qualifications higher education
192930 1587125 2007 intermediate qualifications intermediate qualifications higher education
204987 1587125 2008 intermediate qualifications intermediate qualifications higher education
217044 1587125 2010 intermediate qualifications intermediate qualifications higher education
229101 1587125 2011 higher education higher education higher education
241158 1587125 2012 higher education higher education higher education
253215 1587125 2013 higher education higher education higher education
265272 1587125 2014 higher education higher education higher education
277329 1587125 2015 higher education higher education higher education
289386 1587125 2016 higher education higher education higher education
301443 1587125 2017 higher education higher education higher education
答案 1 :(得分:2)
教育水平显然是有秩序的。您的问题可以重申为“滚动最大值”问题:一个人在某年的最高学历是什么?
尝试一下:
# A dictionary mapping each label to a rank
mappings = {e: i for i, e in enumerate(['No qualifications', 'Other', 'intermediate qualifications', 'higher education'])}
# Convert the label to its rank
edu['Education'] = edu['Education'].map(mappings)
# The gist of the solution: an expanding max level of education per person
tmp = edu.sort_values('Year').groupby('ID')['Education'].expanding().max()
# The first index level in tmp is the ID, the second level is the original index
# We only need the original index, hence the droplevel
# We also convert the rank back to the label (swapping keys and values in the mappings dictionary)
tmp = tmp.droplevel(0).map({v: k for k, v in mappings.items()})
edu['Education'] = tmp
测试:
edu[edu['ID'] == 1587125]
ID Year Education
1587125 1991 intermediate qualifications
1587125 1992 intermediate qualifications
1587125 1993 higher education
1587125 1994 higher education
1587125 1995 higher education
1587125 1996 higher education
1587125 1997 higher education
1587125 1998 higher education
1587125 1999 higher education
1587125 2000 higher education
1587125 2001 higher education
1587125 2002 higher education
1587125 2003 higher education
1587125 2004 higher education
1587125 2005 higher education
1587125 2006 higher education
1587125 2007 higher education
1587125 2008 higher education
1587125 2010 higher education
1587125 2011 higher education
1587125 2012 higher education
1587125 2013 higher education
1587125 2014 higher education
1587125 2015 higher education
1587125 2016 higher education
1587125 2017 higher education
答案 2 :(得分:1)
您可以遍历ID,然后遍历年份。 DataFrame按时间顺序排列,因此,如果某人在某个单元格中具有“高等教育”或“中级资格”,则可以保存此知识并将其应用于后续单元格中:
edu = edu.set_index('ID')
ids = edu.index.unique()
for id in ids:
# booleans to keep track of education statuses we've seen
higher_ed = False
inter_qual = False
rows = edu.loc[id]
for _, row in rows:
# check for intermediate qualifications
if inter_qual:
row['Education'] = 'intermediate qualifications'
elif row['Education'] = 'intermediate qualifications':
inter_qual = True
# check for higher education
if higher_ed:
row['Education'] = 'higher education'
elif row['Education'] = 'higher education':
higher_ed = True
我们可能不止一次地覆盖每个状态,如果一个人同时具有“中级资格”和“高等教育”,我们只需要确保“高等教育”排在最后即可。
我通常不建议使用for循环来处理DataFrame-但是每个单元格值可能都依赖于其上方的值,并且Dataframe不会太大而无法实现。