熊猫数据框根据条件更改列中的值

时间:2020-08-11 14:48:02

标签: python pandas panel-data

我下面有一个大的数据框:

在此处{edu_val.csv}中用作示例的数据可以在https://github.com/ENLK/Py-Projects-/blob/master/education_val.csv

中找到
import pandas as pd 

edu = pd.read_csv('education_val.csv')
del edu['Unnamed: 0']
edu.head(10)

ID  Year    Education
22445   1991    higher education
29925   1991    No qualifications
76165   1991    No qualifications
223725  1991    Other
280165  1991    intermediate qualifications
333205  1991    No qualifications
387605  1991    higher education
541285  1991    No qualifications
541965  1991    No qualifications
599765  1991    No qualifications

Education中的值是:

edu.Education.value_counts()

intermediate qualifications 153705
higher education    67020
No qualifications   55842
Other   36915

我想通过以下方式替换“教育”列中的值:

  1. 如果一个IDhigher education列中的年份中的值为Education,则该ID的所有未来年份也将具有{{1} }在higher education列中。

  2. 如果一个Education在一年中的值为ID,那么该intermediate qualifications的所有未来年份将在相应的{{1}中包含ID }列。但是,如果值intermediate qualifications在此Education的任何后续年份中出现,则higher education在随后的年份中替换ID,无论higher education还是intermediate qualifications

例如,在下面的数据框中,Other年中的No qualifications occur的值为IDhigher education的所有后续1991值应为在以后的年份(直到Education年之前,都用22445替换。

higher education

类似地,以下数据框中的2017 1587125在年份edu.loc[edu['ID'] == 22445] ID Year Education 22445 1991 higher education 22445 1992 higher education 22445 1993 higher education 22445 1994 higher education 22445 1995 higher education 22445 1996 intermediate qualifications 22445 1997 intermediate qualifications 22445 1998 Other 22445 1999 No qualifications 22445 2000 intermediate qualifications 22445 2001 intermediate qualifications 22445 2002 intermediate qualifications 22445 2003 intermediate qualifications 22445 2004 intermediate qualifications 22445 2005 intermediate qualifications 22445 2006 intermediate qualifications 22445 2007 intermediate qualifications 22445 2008 intermediate qualifications 22445 2010 intermediate qualifications 22445 2011 intermediate qualifications 22445 2012 intermediate qualifications 22445 2013 intermediate qualifications 22445 2014 intermediate qualifications 22445 2015 intermediate qualifications 22445 2016 intermediate qualifications 22445 2017 intermediate qualifications 中具有值ID,在intermediate qualifications中变为1991。未来几年(从1993年开始)higher education1993列中的所有后续值都应为Education

1587125

数据中有12,057个唯一的higher education,列edu.loc[edu['ID'] == 1587125] ID Year Education 1587125 1991 intermediate qualifications 1587125 1992 intermediate qualifications 1587125 1993 higher education 1587125 1994 higher education 1587125 1995 higher education 1587125 1996 higher education 1587125 1997 higher education 1587125 1998 higher education 1587125 1999 higher education 1587125 2000 higher education 1587125 2001 higher education 1587125 2002 higher education 1587125 2003 higher education 1587125 2004 Other 1587125 2005 No qualifications 1587125 2006 intermediate qualifications 1587125 2007 intermediate qualifications 1587125 2008 intermediate qualifications 1587125 2010 intermediate qualifications 1587125 2011 higher education 1587125 2012 higher education 1587125 2013 higher education 1587125 2014 higher education 1587125 2015 higher education 1587125 2016 higher education 1587125 2017 higher education 的范围是1991年至2017年。如何根据以上所述更改所有12,057个ID的值条件?我不确定如何针对所有唯一的Year以统一的方式执行此操作。此处用作示例的示例数据附在上面的Github链接中。预先非常感谢。

3 个答案:

答案 0 :(得分:2)

您可以使用categorical data来做到这一点:

df = pd.read_csv('https://raw.githubusercontent.com/ENLK/Py-Projects-/master/education_val.csv')

eddtype = pd.CategoricalDtype(['No qualifications', 
                               'Other',
                               'intermediate qualifications',
                               'higher education'], 
                               ordered=True)
df['EducationCat'] = df['Education'].str.strip().astype(eddtype)

df['EduMax'] = df.sort_values('Year').groupby('ID')['EducationCat']\
                 .transform(lambda x: eddtype.categories[x.cat.codes.cummax()] )

它被明确地分解了,所以您可以看到我正在使用的数据操作。

  1. 创建教育categorical dtype with order
  2. 接下来,将“教育”列的dtype更改为使用该类别 dtype(EducationCat)
  3. 使用分类代码执行cummax计算
  4. 通过索引返回由cummax计算(EduMax)定义的类别

输出:

df[df['ID'] == 1587125]

            ID  Year                    Education                 EducationCat                       EduMax
18      1587125  1991  intermediate qualifications  intermediate qualifications  intermediate qualifications
12075   1587125  1992  intermediate qualifications  intermediate qualifications  intermediate qualifications
24132   1587125  1993             higher education             higher education             higher education
36189   1587125  1994             higher education             higher education             higher education
48246   1587125  1995             higher education             higher education             higher education
60303   1587125  1996             higher education             higher education             higher education
72360   1587125  1997             higher education             higher education             higher education
84417   1587125  1998             higher education             higher education             higher education
96474   1587125  1999             higher education             higher education             higher education
108531  1587125  2000             higher education             higher education             higher education
120588  1587125  2001             higher education             higher education             higher education
132645  1587125  2002             higher education             higher education             higher education
144702  1587125  2003             higher education             higher education             higher education
156759  1587125  2004                        Other                        Other             higher education
168816  1587125  2005            No qualifications            No qualifications             higher education
180873  1587125  2006  intermediate qualifications  intermediate qualifications             higher education
192930  1587125  2007  intermediate qualifications  intermediate qualifications             higher education
204987  1587125  2008  intermediate qualifications  intermediate qualifications             higher education
217044  1587125  2010  intermediate qualifications  intermediate qualifications             higher education
229101  1587125  2011             higher education             higher education             higher education
241158  1587125  2012             higher education             higher education             higher education
253215  1587125  2013             higher education             higher education             higher education
265272  1587125  2014             higher education             higher education             higher education
277329  1587125  2015             higher education             higher education             higher education
289386  1587125  2016             higher education             higher education             higher education
301443  1587125  2017             higher education             higher education             higher education

答案 1 :(得分:2)

教育水平显然是有秩序的。您的问题可以重申为“滚动最大值”问题:一个人在某年的最高学历是什么?

尝试一下:

# A dictionary mapping each label to a rank
mappings = {e: i for i, e in enumerate(['No qualifications', 'Other', 'intermediate qualifications', 'higher education'])}

# Convert the label to its rank
edu['Education'] = edu['Education'].map(mappings)

# The gist of the solution: an expanding max level of education per person
tmp = edu.sort_values('Year').groupby('ID')['Education'].expanding().max()

# The first index level in tmp is the ID, the second level is the original index
# We only need the original index, hence the droplevel
# We also convert the rank back to the label (swapping keys and values in the mappings dictionary)
tmp = tmp.droplevel(0).map({v: k for k, v in mappings.items()})

edu['Education'] = tmp

测试:

edu[edu['ID'] == 1587125]

    ID  Year                    Education
1587125  1991  intermediate qualifications
1587125  1992  intermediate qualifications
1587125  1993             higher education
1587125  1994             higher education
1587125  1995             higher education
1587125  1996             higher education
1587125  1997             higher education
1587125  1998             higher education
1587125  1999             higher education
1587125  2000             higher education
1587125  2001             higher education
1587125  2002             higher education
1587125  2003             higher education
1587125  2004             higher education
1587125  2005             higher education
1587125  2006             higher education
1587125  2007             higher education
1587125  2008             higher education
1587125  2010             higher education
1587125  2011             higher education
1587125  2012             higher education
1587125  2013             higher education
1587125  2014             higher education
1587125  2015             higher education
1587125  2016             higher education
1587125  2017             higher education

答案 2 :(得分:1)

您可以遍历ID,然后遍历年份。 DataFrame按时间顺序排列,因此,如果某人在某个单元格中具有“高等教育”或“中级资格”,则可以保存此知识并将其应用于后续单元格中:

edu = edu.set_index('ID')
ids = edu.index.unique()

for id in ids:
    # booleans to keep track of education statuses we've seen
    higher_ed = False
    inter_qual = False

    rows = edu.loc[id]
    for _, row in rows:
        # check for intermediate qualifications
        if inter_qual:
            row['Education'] = 'intermediate qualifications'
        elif row['Education'] = 'intermediate qualifications':
            inter_qual = True

        # check for higher education
        if higher_ed:
            row['Education'] = 'higher education'
        elif row['Education'] = 'higher education':
            higher_ed = True

我们可能不止一次地覆盖每个状态,如果一个人同时具有“中级资格”和“高等教育”,我们只需要确保“高等教育”排在最后即可。

我通常不建议使用for循环来处理DataFrame-但是每个单元格值可能都依赖于其上方的值,并且Dataframe不会太大而无法实现。