Question

我有一个DataFrame，df，看起来像：

ID    |          TERM       |   DISC_1
1     |         2003-10     |   ECON
1     |         2002-01     |   ECON
1     |         2002-10     |   ECON
2     |         2003-10     |   CHEM
2     |         2004-01     |   CHEM 
2     |         2004-10     |   ENGN
2     |         2005-01     |   ENGN
3     |         2001-01     |   HISTR
3     |         2002-10     |   HISTR 
3     |         2002-10     |   HISTR

ID是学生ID，TERM是学术术语，DISC_1是他们专业的学科。对于每个学生，我想在（如果）更改DISC_1时识别TERM，然后创建一个报告何时的新DataFrame。零表示他们没有改变。输出如下：

ID    |     Change
1     |         0     
2     |         2004-01    
3     |         0

我的代码可以使用，但速度非常慢。我尝试使用Groupby执行此操作，但无法执行此操作。有人可以解释我如何更有效地完成这项任务吗？

df = df.sort_values(by = ['PIDM', 'TERM'])
c = 0
last_PIDM = 0
last_DISC_1 = 0
change = [ ]
for index, row in df.iterrows():
    c = c + 1
    if c > 1:
        row['change'] = np.where((row['PIDM'] == last_PIDM) & (row['DISC_1'] != last_DISC_1),     row['TERM'], 0)
        last_PIDM = row['PIDM']
        last_DISC_1 = row['DISC_1']

    else:
        row['change'] = 0
    change.append(row['change'])  

df['change'] = change        
change_terms = df.groupby('PIDM')['change'].max()

Answer 1

这是一个开始：

df = df.sort_values(['ID', 'TERM'])
gb = df.groupby('ID').DISC_1
df['Change'] = df.TERM[gb.apply(lambda x: x != x.shift().bfill())]
df.Change = df.Change.fillna(0)

Answer 2

我从来不是一个大熊猫用户，所以我的解决方案将涉及将df作为csv吐出，并迭代每一行，同时保留前一行。如果它被正确排序（首先按ID，然后按期限日期）我可能会写这样的东西......

import csv

with open('inputDF.csv', 'rb') as infile:
    with open('outputDF.csv', 'wb') as outfile:
        reader = csv.reader(infile)
        writer = csv.writer(outfile)

        previousline = reader.next()  #grab the first row to compare to the second
        termChange = 0
        for line in reader:
            if line[0] != previousline[0]:  #new ID means print and move on to next person
                writer.writerow([previousline[0], termChange])  #print to file ID, termChange date
                termChange = 0
            elif line[2] != previousline[2]:  #new discipline
                termChange = line[1]  #set term changed date
                #termChange = previousline[1]  #in case you want to rather retain the last date they were in the old dicipline

            previousline = line  #store current line as previous and continue loop

比较一个熊猫行中的值与前一行中另一个行的值的快速方法？

2 个答案: