我有一个叫做&p 39;调整的pandas完整数据框。我想添加专栏' stage'新的价值观条件在' fyear'并且' conm'。
fyear conm indadjsg
1 1999 1-800-FLOWERS.COM 26.646086
2 2000 1-800-FLOWERS.COM 22.727175
3 2001 1-800-FLOWERS.COM 7.312014
4 2002 1-800-FLOWERS.COM 4.948308
5 2003 1-800-FLOWERS.COM 6.278798
23 1996 ABERCROMBIE & FITCH -CL A 34.831691
24 1997 ABERCROMBIE & FITCH -CL A 48.053137
25 1998 ABERCROMBIE & FITCH -CL A 48.918326
26 1999 ABERCROMBIE & FITCH -CL A 46.956456
27 2000 ABERCROMBIE & FITCH -CL A 33.91436
28 2001 ABERCROMBIE & FITCH -CL A 67.23423
29 2002 ABERCROMBIE & FITCH -CL A 99.09342
11929 2006 CLIFTON BANCORP INC 0.236418
11930 2007 CLIFTON BANCORP INC -1.366626
11931 2008 CLIFTON BANCORP INC 8.564019
11932 2009 CLIFTON BANCORP INC -4.966110
11933 2010 CLIFTON BANCORP INC -4.359552
11934 2011 CLIFTON BANCORP INC -16.313852
11935 2012 CLIFTON BANCORP INC -18.193550
11936 2013 CLIFTON BANCORP INC -10.126603
11937 2014 CLIFTON BANCORP INC 4.718584
11938 2015 CLIFTON BANCORP INC -11.889065
11940 2015 CLIPPER REALTY INC 70.945767
11941 2016 CLIPPER REALTY INC 3.776001
11980 2014 CM FINANCE INC 205.894048
11981 2015 CM FINANCE INC 68.518555
121247 2009 VCA INC -5.552030
121248 2010 VCA INC -3.357275
121249 2011 VCA INC -0.930798
121250 2012 VCA INC 5.974914
121256 2007 VIASPACE INC -50.966869
121257 2008 VIASPACE INC 149.957403
121258 2009 VIASPACE INC 197.776855
121259 2010 VIASPACE INC -25.201733
121260 2011 VIASPACE INC 77.082624
121261 2012 VIASPACE INC 78.034233
121266 2005 YASHENG GROUP -3.728098
121267 2006 YASHENG GROUP -2.233927
121268 2007 YASHENG GROUP 0.349349
121279 2009 YUHE INTERNATIONAL INC 27.995324
121280 2010 YUHE INTERNATIONAL INC 34.375630
1)我想填写“开始”#39;如果独特公司的化学年数等于或小于5.
byyr = adjusted.groupby(by=['conm'])['fyear']
dfbyyr =byyr.count().to_frame()
start = dfbyyr[dfbyyr['fyear'] <= 5]
fyear
conm
1-800-FLOWERS.COM 5
ABERCROMBIE & FITCH -CL A 7
CLIFTON BANCORP INC 10
CLIPPER REALTY INC 2
CM FINANCE INC 2
VCA INC 4
VIASPACE INC 6
YASHENG GROUP 3
YUHE INTERNATIONAL INC 2
2)在我填写&#39;开始&#39;条件,剩下的数据,我想填补另一个值。 我计算了独特公司的平均值indadjsg。
mask2 = adjusted.groupby(by=['conm'])['indadjsg']
countsg = mask2.mean().to_frame().reset_index()
c = countsg.dropna()
数据框&#39; c&#39;
conm indadjsg
0 1-800-FLOWERS.COM 3.291539
1 ABERCROMBIE & FITCH -CL A 105.335324
2 CLIFTON BANCORP INC 22.920683
3 CLIPPER REALTY INC 36.784677
4 CM FINANCE INC 1.605919
5 VCA INC 3.116871
6 VIASPACE INC -106.153789
7 YASHENG GROUP -2.676296
8 YUHE INTERNATIONAL INC 12.306557
我想给出的条件如下:
indadjsg < 0, 'decline'
0 <= indadjsg <= 15, 'revival'
15< indadjsg <= 100, 'mature'
100< indajsg , 'growth'
我想制作的最终数据框架是这样的
fyear conm indadjsg stage
1 1999 1-800-FLOWERS.COM 26.646086 start
2 2000 1-800-FLOWERS.COM 22.727175 start
3 2001 1-800-FLOWERS.COM 7.312014 start
4 2002 1-800-FLOWERS.COM 4.948308 start
5 2003 1-800-FLOWERS.COM 6.278798 start
23 1996 ABERCROMBIE & FITCH -CL A 34.831691 growth
24 1997 ABERCROMBIE & FITCH -CL A 48.053137 growth
25 1998 ABERCROMBIE & FITCH -CL A 48.918326 growth
26 1999 ABERCROMBIE & FITCH -CL A 46.956456 growth
27 2000 ABERCROMBIE & FITCH -CL A 33.91436 growth
28 2001 ABERCROMBIE & FITCH -CL A 67.23423 growth
29 2002 ABERCROMBIE & FITCH -CL A 99.09342 growth
11929 2006 CLIFTON BANCORP INC 0.236418 mature
11930 2007 CLIFTON BANCORP INC -1.366626 mature
11931 2008 CLIFTON BANCORP INC 8.564019 mature
11932 2009 CLIFTON BANCORP INC -4.966110 mature
11933 2010 CLIFTON BANCORP INC -4.359552 mature
11934 2011 CLIFTON BANCORP INC -16.313852 mature
11935 2012 CLIFTON BANCORP INC -18.193550 mature
11936 2013 CLIFTON BANCORP INC -10.126603 mature
11937 2014 CLIFTON BANCORP INC 4.718584 mature
11938 2015 CLIFTON BANCORP INC -11.889065 mature
11940 2015 CLIPPER REALTY INC 70.945767 start
11941 2016 CLIPPER REALTY INC 3.776001 start
11980 2014 CM FINANCE INC 205.894048 start
11981 2015 CM FINANCE INC 68.518555 start
121247 2009 VCA INC -5.552030 start
121248 2010 VCA INC -3.357275 start
121249 2011 VCA INC -0.930798 start
121250 2012 VCA INC 5.974914 start
121256 2007 VIASPACE INC -50.966869 decline
121257 2008 VIASPACE INC 149.957403 decline
121258 2009 VIASPACE INC 197.776855 decline
121259 2010 VIASPACE INC -25.201733 decline
121260 2011 VIASPACE INC 77.082624 decline
121261 2012 VIASPACE INC 78.034233 decline
121266 2005 YASHENG GROUP -3.728098 start
121267 2006 YASHENG GROUP -2.233927 start
121268 2007 YASHENG GROUP 0.349349 start
121279 2009 YUHE INTERNATIONAL INC 27.995324 start
121280 2010 YUHE INTERNATIONAL INC 34.375630 start
我有什么方法可以立刻做到吗?我只能考虑制作单独的列并合并它。你能帮我有效地思考吗?提前谢谢。
答案 0 :(得分:3)
adjusted # copied text from your example
Out[86]:
fyear conm indadjsg
0 1999 1-800-FLOWERS.COM 26.64609
1 2000 1-800-FLOWERS.COM 22.72717
2 2001 1-800-FLOWERS.COM 7.31201
3 2002 1-800-FLOWERS.COM 4.94831
4 2003 1-800-FLOWERS.COM 6.27880
5 1996 ABERCROMBIE 34.83169
6 1997 ABERCROMBIE 48.05314
7 1998 ABERCROMBIE 48.91833
8 1999 ABERCROMBIE 46.95646
9 2000 ABERCROMBIE 33.91436
10 2001 ABERCROMBIE 67.23423
11 2002 ABERCROMBIE 99.09342
.. ... ... ...
25 2015 CM 68.51856
26 2009 VCA -5.55203
27 2010 VCA -3.35728
28 2011 VCA -0.93080
29 2012 VCA 5.97491
30 2007 VIASPACE -50.96687
31 2008 VIASPACE 149.95740
32 2009 VIASPACE 197.77686
33 2010 VIASPACE -25.20173
34 2011 VIASPACE 77.08262
35 2012 VIASPACE 78.03423
36 2005 YASHENG -3.72810
byyr = adjusted.groupby(by='conm')['fyear'].count().to_frame()
start = byyr.fyear[adjusted.conm]
indadjsg = adjusted.groupby(by='conm')['indadjsg'].mean().to_frame()
px = indadjsg.indadjsg[adjusted.conm]
categories = pd.cut(px.values.reshape((len(px), )),
bins= [-np.inf, 0, 15, 100, np.inf],
labels=['decline', 'revival', 'mature', 'growth'])
adjusted.loc[:, 'stage'] = np.where(start <= 5, 'start', categories)
adjusted # result
Out[130]:
fyear conm indadjsg stage
0 1999 1-800-FLOWERS.COM 26.64609 start
1 2000 1-800-FLOWERS.COM 22.72717 start
2 2001 1-800-FLOWERS.COM 7.31201 start
3 2002 1-800-FLOWERS.COM 4.94831 start
4 2003 1-800-FLOWERS.COM 6.27880 start
5 1996 ABERCROMBIE 34.83169 mature
6 1997 ABERCROMBIE 48.05314 mature
7 1998 ABERCROMBIE 48.91833 mature
8 1999 ABERCROMBIE 46.95646 mature
9 2000 ABERCROMBIE 33.91436 mature
10 2001 ABERCROMBIE 67.23423 mature
11 2002 ABERCROMBIE 99.09342 mature
.. ... ... ... ...
25 2015 CM 68.51856 start
26 2009 VCA -5.55203 start
27 2010 VCA -3.35728 start
28 2011 VCA -0.93080 start
29 2012 VCA 5.97491 start
30 2007 VIASPACE -50.96687 mature
31 2008 VIASPACE 149.95740 mature
32 2009 VIASPACE 197.77686 mature
33 2010 VIASPACE -25.20173 mature
34 2011 VIASPACE 77.08262 mature
35 2012 VIASPACE 78.03423 mature
36 2005 YASHENG -3.72810 start
在pd.cut上,请务必使用right=True
或right=False.
答案 1 :(得分:1)
有一种方法可以使用单个stage
操作计算groupby/transform
列(请参阅下面的classify
函数),但它涉及为每个组调用一次自定义Python函数。如果有很多组,这往往效率低下。
通常,当您替换大量Python时,您会获得更好的性能 函数调用在整个(大)DataFrame上进行矢量化操作 DataFrame的大列。
因此,如果有很多conm
s(即很多群组),那么可能会更好
按照你的第一个想法 - 为每个公司计算阶段然后合并
结果回到adjusted
。这是一种方法 - 合并是
通过致电join
来完成:
import numpy as np
import pandas as pd
adjusted = pd.DataFrame({'conm': ['1-800-FLOWERS.COM', '1-800-FLOWERS.COM', '1-800-FLOWERS.COM', '1-800-FLOWERS.COM', '1-800-FLOWERS.COM', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'ABERCROMBIE & FITCH -CL A', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIFTON BANCORP INC', 'CLIPPER REALTY INC', 'CLIPPER REALTY INC', 'CM FINANCE INC', 'CM FINANCE INC', 'VCA INC', 'VCA INC', 'VCA INC', 'VCA INC', 'VIASPACE INC', 'VIASPACE INC', 'VIASPACE INC', 'VIASPACE INC', 'VIASPACE INC', 'VIASPACE INC', 'YASHENG GROUP', 'YASHENG GROUP', 'YASHENG GROUP', 'YUHE INTERNATIONAL INC', 'YUHE INTERNATIONAL INC'], 'fyear': [1999, 2000, 2001, 2002, 2003, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2015, 2016, 2014, 2015, 2009, 2010, 2011, 2012, 2007, 2008, 2009, 2010, 2011, 2012, 2005, 2006, 2007, 2009, 2010], 'indadjsg': [26.646085999999997, 22.727175, 7.312014, 4.948308, 6.278798, 34.831691, 48.053137, 48.918326, 46.956456, 33.914359999999995, 67.23423000000001, 99.09342, 0.236418, -1.3666260000000001, 8.564019, -4.96611, -4.359552, -16.313852, -18.19355, -10.126603, 4.718584, -11.889064999999999, 70.945767, 3.7760010000000004, 205.894048, 68.518555, -5.55203, -3.357275, -0.9307979999999999, 5.974914, -50.966869, 149.957403, 197.776855, -25.201732999999997, 77.082624, 78.034233, -3.728098, -2.233927, 0.34934899999999997, 27.995324, 34.37563]}, index=[1, 2, 3, 4, 5, 23, 24, 25, 26, 27, 28, 29, 11929, 11930, 11931, 11932, 11933, 11934, 11935, 11936, 11937, 11938, 11940, 11941, 11980, 11981, 121247, 121248, 121249, 121250, 121256, 121257, 121258, 121259, 121260, 121261, 121266, 121267, 121268, 121279, 121280])
grouped = adjusted.groupby(by=['conm'])
stage = pd.cut(grouped['indadjsg'].mean(), bins=[-np.inf,0,15,100,np.inf], labels=False)
stage.name = 'stage'
labels = np.array(['decline', 'revival', 'mature', 'growth'])
adjusted = adjusted.join(stage, on='conm')
adjusted['stage'] = labels[adjusted['stage']]
mask = (grouped['fyear'].transform('count') <= 5)
adjusted.loc[mask, 'stage'] = 'start'
print(adjusted)
产量
conm fyear indadjsg stage
1 1-800-FLOWERS.COM 1999 26.646086 start
2 1-800-FLOWERS.COM 2000 22.727175 start
3 1-800-FLOWERS.COM 2001 7.312014 start
4 1-800-FLOWERS.COM 2002 4.948308 start
5 1-800-FLOWERS.COM 2003 6.278798 start
23 ABERCROMBIE & FITCH -CL A 1996 34.831691 mature
24 ABERCROMBIE & FITCH -CL A 1997 48.053137 mature
25 ABERCROMBIE & FITCH -CL A 1998 48.918326 mature
26 ABERCROMBIE & FITCH -CL A 1999 46.956456 mature
27 ABERCROMBIE & FITCH -CL A 2000 33.914360 mature
28 ABERCROMBIE & FITCH -CL A 2001 67.234230 mature
29 ABERCROMBIE & FITCH -CL A 2002 99.093420 mature
11929 CLIFTON BANCORP INC 2006 0.236418 decline
11930 CLIFTON BANCORP INC 2007 -1.366626 decline
11931 CLIFTON BANCORP INC 2008 8.564019 decline
11932 CLIFTON BANCORP INC 2009 -4.966110 decline
11933 CLIFTON BANCORP INC 2010 -4.359552 decline
11934 CLIFTON BANCORP INC 2011 -16.313852 decline
11935 CLIFTON BANCORP INC 2012 -18.193550 decline
11936 CLIFTON BANCORP INC 2013 -10.126603 decline
11937 CLIFTON BANCORP INC 2014 4.718584 decline
11938 CLIFTON BANCORP INC 2015 -11.889065 decline
11940 CLIPPER REALTY INC 2015 70.945767 start
11941 CLIPPER REALTY INC 2016 3.776001 start
11980 CM FINANCE INC 2014 205.894048 start
11981 CM FINANCE INC 2015 68.518555 start
121247 VCA INC 2009 -5.552030 start
121248 VCA INC 2010 -3.357275 start
121249 VCA INC 2011 -0.930798 start
121250 VCA INC 2012 5.974914 start
121256 VIASPACE INC 2007 -50.966869 mature
121257 VIASPACE INC 2008 149.957403 mature
121258 VIASPACE INC 2009 197.776855 mature
121259 VIASPACE INC 2010 -25.201733 mature
121260 VIASPACE INC 2011 77.082624 mature
121261 VIASPACE INC 2012 78.034233 mature
121266 YASHENG GROUP 2005 -3.728098 start
121267 YASHENG GROUP 2006 -2.233927 start
121268 YASHENG GROUP 2007 0.349349 start
121279 YUHE INTERNATIONAL INC 2009 27.995324 start
121280 YUHE INTERNATIONAL INC 2010 34.375630 start
这是替代方法,当有很多组时速度较慢(但如果组很少,可能会更快)。
您可以使用单个stage
操作计算groupby/transform
列
使用自定义Python函数classify
。 classify
为每个组调用一次 - 即每conm
的值一次。
import bisect
def classify(grp, grid=[0,15,100,np.inf],
labels=['decline', 'revival', 'mature', 'growth']):
return 'start' if len(grp) <= 5 else labels[bisect.bisect_left(grid, grp.mean())]
grouped = adjusted.groupby(by=['conm'])
adjusted['stage'] = grouped['indadjsg'].transform(classify)
print(adjusted)
答案 2 :(得分:0)
从:
开始Adjusted:
fyear conm indadjsg
0 1999 1-800-FLOWERS.COM 26.646086
1 2000 1-800-FLOWERS.COM 22.727175
2 2001 1-800-FLOWERS.COM 7.312014
3 2002 1-800-FLOWERS.COM 4.948308
4 2003 1-800-FLOWERS.COM 6.278798
5 1996 ABERCROMBIE & FITCH -CL A 34.831691
6 1997 ABERCROMBIE & FITCH -CL A 48.053137
...
35 2012 VIASPACE INC 78.034233
36 2005 YASHENG GROUP -3.728098
37 2006 YASHENG GROUP -2.233927
38 2007 YASHENG GROUP 0.349349
39 2009 YUHE INTERNATIONAL INC 27.995324
40 2010 YUHE INTERNATIONAL INC 34.375630
这段代码并不是特别聪明,但它非常简单:
# add an empty "stage" column
adjusted['stage'] = ''
# create boolean masks for each stage classification
g = adjusted.groupby(by='conm')
decline = g['indadjsg'].transform('mean') < 0
revival = (g['indadjsg'].transform('mean') >= 0) & (g['indadjsg'].transform('mean') <= 15)
mature = (g['indadjsg'].transform('mean') > 15) & (g['indadjsg'].transform('mean') <= 100)
growth = (g['indadjsg'].transform('mean') > 100)
start = g['fyear'].transform('count') <= 5
adjusted.loc[decline, 'stage'] = 'decline'
adjusted.loc[revival, 'stage'] = 'revival'
adjusted.loc[mature, 'stage'] = 'mature'
adjusted.loc[growth, 'stage'] = 'growth'
# set 'start' classification last so it overwrites
# the classification set based on
adjusted.loc[start, 'stage'] = 'start'
输出如下:
fyear conm indadjsg stage
0 1999 1-800-FLOWERS.COM 26.646086 start
1 2000 1-800-FLOWERS.COM 22.727175 start
2 2001 1-800-FLOWERS.COM 7.312014 start
3 2002 1-800-FLOWERS.COM 4.948308 start
4 2003 1-800-FLOWERS.COM 6.278798 start
5 1996 ABERCROMBIE & FITCH -CL A 34.831691 mature
6 1997 ABERCROMBIE & FITCH -CL A 48.053137 mature
7 1998 ABERCROMBIE & FITCH -CL A 48.918326 mature
8 1999 ABERCROMBIE & FITCH -CL A 46.956456 mature
9 2000 ABERCROMBIE & FITCH -CL A 33.914360 mature
10 2001 ABERCROMBIE & FITCH -CL A 67.234230 mature
11 2002 ABERCROMBIE & FITCH -CL A 99.093420 mature
12 2006 CLIFTON BANCORP INC 0.236418 decline
13 2007 CLIFTON BANCORP INC -1.366626 decline
14 2008 CLIFTON BANCORP INC 8.564019 decline
15 2009 CLIFTON BANCORP INC -4.966110 decline
16 2010 CLIFTON BANCORP INC -4.359552 decline
17 2011 CLIFTON BANCORP INC -16.313852 decline
18 2012 CLIFTON BANCORP INC -18.193550 decline
19 2013 CLIFTON BANCORP INC -10.126603 decline
20 2014 CLIFTON BANCORP INC 4.718584 decline
21 2015 CLIFTON BANCORP INC -11.889065 decline
22 2015 CLIPPER REALTY INC 70.945767 start
23 2016 CLIPPER REALTY INC 3.776001 start
24 2014 CM FINANCE INC 205.894048 start
25 2015 CM FINANCE INC 68.518555 start
26 2009 VCA INC -5.552030 start
27 2010 VCA INC -3.357275 start
28 2011 VCA INC -0.930798 start
29 2012 VCA INC 5.974914 start
30 2007 VIASPACE INC -50.966869 mature
31 2008 VIASPACE INC 149.957403 mature
32 2009 VIASPACE INC 197.776855 mature
33 2010 VIASPACE INC -25.201733 mature
34 2011 VIASPACE INC 77.082624 mature
35 2012 VIASPACE INC 78.034233 mature
36 2005 YASHENG GROUP -3.728098 start
37 2006 YASHENG GROUP -2.233927 start
38 2007 YASHENG GROUP 0.349349 start
39 2009 YUHE INTERNATIONAL INC 27.995324 start
40 2010 YUHE INTERNATIONAL INC 34.375630 start