我有一个DataFrame,df,看起来像:
ID | TERM | DISC_1
1 | 2003-10 | ECON
1 | 2002-01 | ECON
1 | 2002-10 | ECON
2 | 2003-10 | CHEM
2 | 2004-01 | CHEM
2 | 2004-10 | ENGN
2 | 2005-01 | ENGN
3 | 2001-01 | HISTR
3 | 2002-10 | HISTR
3 | 2002-10 | HISTR
ID是学生ID,TERM是学术术语,DISC_1是他们专业的学科。对于每个学生,我想在(如果)更改DISC_1时识别TERM,然后创建一个报告何时的新DataFrame。零表示他们没有改变。输出如下:
ID | Change
1 | 0
2 | 2004-01
3 | 0
我的代码可以使用,但速度非常慢。我尝试使用Groupby执行此操作,但无法执行此操作。有人可以解释我如何更有效地完成这项任务吗?
df = df.sort_values(by = ['PIDM', 'TERM'])
c = 0
last_PIDM = 0
last_DISC_1 = 0
change = [ ]
for index, row in df.iterrows():
c = c + 1
if c > 1:
row['change'] = np.where((row['PIDM'] == last_PIDM) & (row['DISC_1'] != last_DISC_1), row['TERM'], 0)
last_PIDM = row['PIDM']
last_DISC_1 = row['DISC_1']
else:
row['change'] = 0
change.append(row['change'])
df['change'] = change
change_terms = df.groupby('PIDM')['change'].max()
答案 0 :(得分:4)
这是一个开始:
df = df.sort_values(['ID', 'TERM'])
gb = df.groupby('ID').DISC_1
df['Change'] = df.TERM[gb.apply(lambda x: x != x.shift().bfill())]
df.Change = df.Change.fillna(0)
答案 1 :(得分:2)
我从来不是一个大熊猫用户,所以我的解决方案将涉及将df作为csv吐出,并迭代每一行,同时保留前一行。如果它被正确排序(首先按ID,然后按期限日期)我可能会写这样的东西......
import csv
with open('inputDF.csv', 'rb') as infile:
with open('outputDF.csv', 'wb') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
previousline = reader.next() #grab the first row to compare to the second
termChange = 0
for line in reader:
if line[0] != previousline[0]: #new ID means print and move on to next person
writer.writerow([previousline[0], termChange]) #print to file ID, termChange date
termChange = 0
elif line[2] != previousline[2]: #new discipline
termChange = line[1] #set term changed date
#termChange = previousline[1] #in case you want to rather retain the last date they were in the old dicipline
previousline = line #store current line as previous and continue loop