我正在尝试构造一个数据透视表(以后将变成Markov Chain的转换矩阵)。以下是虚假数据,或多或少代表真实数据。我有20多年的数据,每年至少有2亿行。
import numpy as np
import pandas as pd
newd = {'year': [2001, 2002, 2005, 2002, 2004, 1999, 1999, 1999, 2012, 2000, 2010, 2005, 2006, 2009, 2009,
2009, 2009, 2010, 2007, 2008, 2009, 2010, 2000, 2001, 2002],
'tin': [12, 23, 24, 28,30, 12,7, 12, 12, 23, 24, 7, 12, 35, 39,37, 36, 333, 13, 13, 13, 13, 7, 7, 7],
'ptin': [12, 23, 28, 22, 12, 12,0, 12, 12, 23, 27, 45, 99, 7, 7, 7, 7, 0, 17, 21, 26, 18, 0, 18, 19] }
newdf = pd.DataFrame(newd)
df=newdf.groupby(['tin', 'year'])['ptin'].groups
#print(list(df))
#print(np.unique(newdf['year'].values))
print(newdf.pivot_table(index=['tin', 'year'], columns='ptin', values=['ptin', 'tin'], aggfunc=len, fill_value=0))
导致
ptin 0 7 12 17 18 19 21 22 23 26 27 28 45 99
tin year
7 1999 1 0 0 0 0 0 0 0 0 0 0 0 0 0
2000 1 0 0 0 0 0 0 0 0 0 0 0 0 0
2001 0 0 0 0 1 0 0 0 0 0 0 0 0 0
2002 0 0 0 0 0 1 0 0 0 0 0 0 0 0
2005 0 0 0 0 0 0 0 0 0 0 0 0 1 0
12 1999 0 0 2 0 0 0 0 0 0 0 0 0 0 0
2001 0 0 1 0 0 0 0 0 0 0 0 0 0 0
2006 0 0 0 0 0 0 0 0 0 0 0 0 0 1
2012 0 0 1 0 0 0 0 0 0 0 0 0 0 0
13 2007 0 0 0 1 0 0 0 0 0 0 0 0 0 0
2008 0 0 0 0 0 0 1 0 0 0 0 0 0 0
2009 0 0 0 0 0 0 0 0 0 1 0 0 0 0
2010 0 0 0 0 1 0 0 0 0 0 0 0 0 0
23 2000 0 0 0 0 0 0 0 0 1 0 0 0 0 0
2002 0 0 0 0 0 0 0 0 1 0 0 0 0 0
24 2005 0 0 0 0 0 0 0 0 0 0 0 1 0 0
2010 0 0 0 0 0 0 0 0 0 0 1 0 0 0
28 2002 0 0 0 0 0 0 0 1 0 0 0 0 0 0
30 2004 0 0 1 0 0 0 0 0 0 0 0 0 0 0
35 2009 0 1 0 0 0 0 0 0 0 0 0 0 0 0
36 2009 0 1 0 0 0 0 0 0 0 0 0 0 0 0
37 2009 0 1 0 0 0 0 0 0 0 0 0 0 0 0
39 2009 0 1 0 0 0 0 0 0 0 0 0 0 0 0
333 2010 1 0 0 0 0 0 0 0 0 0 0 0 0 0
与上面的数据框一致。
现在的问题是,如果我需要为三列运行数十亿行,此代码有效吗?例如,未来几年将有650万列,并且始终有2亿行。有人遇到过这个吗?
现在我也在尝试建立x年中锡从ptin y过渡到ptin z的转变概率吗?有什么想法吗?我知道这要问的太多了,但是,我在Stackoverflow中得到了很多答案。我正在考虑的另一种策略是对数据进行分箱,但还没有想法。