我有一个大熊猫DF(此时)大约有4000行,我按照我需要的顺序放了......
我需要通过rowcount将DF分成任意10个bin。因此,我希望将bin编号附加到行以供将来的聚合计算使用bin..about 400,因为最后一个bin不会被填写。
我现在不想在一个操作中执行此操作,我只需要将行号附加到行中。这样我就可以通过聚合在另一个步骤中完成该组。
我已将pd.qcut和pd.cut捆绑在一起,但似乎没有运气。
ccs_category sex race marital_status presence_of_child payer_type no_of_persons occupation education wealth_rate donate_charity zip age 48_prediction count cum_count cum_accurate cum_percent
263 218 M U S N PK 999 99 0 99 U 60657.0 0 0.000538 1 1 0 0.0
2452 250 M W U U NK 0 99 0 99 U 8730.0 29 0.000404 1 2 0 0.0
2814 127 F W U N MK 2 8 2 9 Y 53051.0 75 0.000369 1 3 0 0.0
所以我希望最后一列是bin_nu,顺序标签为1-10(0-9)就可以了。 E.G第一个400标签= 1第二个400标签= 2等。
results.3['bin_nu']=pd.cut(results3,10,labels=False)
我尝试使用retbins = True但没有结果 给了我:
TypeError: '<=' not supported between instances of 'str' and 'int'
我看了一下但是一切似乎都是按值(和时间序列)而不是任意项目计数进行分类?
以下是DF的前十行到dict的转换:
{'48_prediction': {263: 0.0005377883207984269,
578: 0.00030468139448203146,
1168: 0.0003036215784959495,
2021: 0.0003635243338067085,
2452: 0.0004036910831928253,
2518: 0.00029300281312316656,
2814: 0.00036939769051969046,
3687: 0.0003046905330847949,
3963: 0.00029347193776629865,
7132: 0.00035702803870663047},
'age': {263: 0,
578: 68,
1168: 1,
2021: 42,
2452: 29,
2518: 92,
2814: 75,
3687: 40,
3963: 51,
7132: 34},
'ccs_category': {263: '218',
578: '250',
1168: '166',
2021: '90',
2452: '250',
2518: '3',
2814: '127',
3687: '58',
3963: '160',
7132: '196'},
'count': {263: 1,
578: 1,
1168: 1,
2021: 1,
2452: 1,
2518: 1,
2814: 1,
3687: 1,
3963: 1,
7132: 1},
'cum_accurate': {263: 0,
578: 0,
1168: 0,
2021: 0,
2452: 0,
2518: 0,
2814: 0,
3687: 0,
3963: 0,
7132: 0},
'cum_count': {263: 1,
578: 7,
1168: 8,
2021: 4,
2452: 2,
2518: 10,
2814: 3,
3687: 6,
3963: 9,
7132: 5},
'cum_percent': {263: 0.0,
578: 0.0,
1168: 0.0,
2021: 0.0,
2452: 0.0,
2518: 0.0,
2814: 0.0,
3687: 0.0,
3963: 0.0,
7132: 0.0},
'donate_charity': {263: 'U',
578: 'U',
1168: 'Y',
2021: 'U',
2452: 'U',
2518: 'U',
2814: 'Y',
3687: 'U',
3963: 'U',
7132: 'U'},
'education': {263: 0,
578: 0,
1168: 0,
2021: 1,
2452: 0,
2518: 0,
2814: 2,
3687: 0,
3963: 4,
7132: 0},
'marital_status': {263: 'S',
578: 'U',
1168: 'U',
2021: 'M',
2452: 'U',
2518: 'U',
2814: 'U',
3687: 'U',
3963: 'M',
7132: 'U'},
'no_of_persons': {263: 999,
578: 0,
1168: 2,
2021: 2,
2452: 0,
2518: 0,
2814: 2,
3687: 0,
3963: 3,
7132: 0},
'occupation': {263: '99',
578: '99',
1168: '99',
2021: '0',
2452: '99',
2518: '99',
2814: '8',
3687: '99',
3963: '0',
7132: '99'},
'payer_type': {263: 'PK',
578: 'MI',
1168: 'PK',
2021: 'NK',
2452: 'NK',
2518: 'MK',
2814: 'MK',
3687: 'PK',
3963: 'PK',
7132: 'NK'},
'presence_of_child': {263: 'N',
578: 'U',
1168: 'Y',
2021: 'Y',
2452: 'U',
2518: 'U',
2814: 'N',
3687: 'Y',
3963: 'N',
7132: 'U'},
'race': {263: 'U',
578: 'B',
1168: 'W',
2021: 'W',
2452: 'W',
2518: 'W',
2814: 'W',
3687: 'B',
3963: 'W',
7132: 'A'},
'sex': {263: 'M',
578: 'F',
1168: 'M',
2021: 'M',
2452: 'M',
2518: 'M',
2814: 'F',
3687: 'F',
3963: 'F',
7132: 'F'},
'wealth_rate': {263: '99',
578: '7',
1168: '8',
2021: '6',
2452: '99',
2518: '99',
2814: '9',
3687: '0',
3963: '6',
7132: '8'},
'zip': {263: 60657.0,
578: 76021.0,
1168: 85711.0,
2021: 7747.0,
2452: 8730.0,
2518: 30680.0,
2814: 53051.0,
3687: 7740.0,
3963: 62025.0,
7132: 19082.0}}
答案 0 :(得分:1)
让我们从一个包含50行的随机数据框开始:
df = pd.DataFrame(np.random.randn(50, 4), columns=list("ABCD"))
A B C D
0 0.113454 3.357840 -0.413755 -1.089784
1 0.800012 0.655826 0.688414 0.012480
2 0.604902 -0.332028 0.470119 -0.370570
3 0.661120 0.635879 -0.441816 -0.847047
4 0.836218 2.597254 1.029996 0.554012
.. 0.076679 0.262971 0.687525 0.195338
49 1.948361 -0.801236 2.075301 -0.540771
你可以使用groupby来填充它并获得块:
for sub_df_index, sub_df in df.groupby(np.arange(len(df)) // 10):
print(sub_df.head(10))
前三个块:
A B C D
0 0.113454 3.357840 -0.413755 -1.089784
1 0.800012 0.655826 0.688414 0.012480
2 0.604902 -0.332028 0.470119 -0.370570
3 0.661120 0.635879 -0.441816 -0.847047
4 0.836218 2.597254 1.029996 0.554012
A B C D
5 -0.236094 1.714750 -0.091074 0.182944
6 0.928875 -1.125854 0.493389 0.309107
7 -0.238064 1.566493 -0.244627 0.744391
8 0.041049 0.423166 1.020502 -0.467028
9 0.290232 2.119993 -0.174697 0.784637
A B C D
10 -0.600395 0.604698 0.220617 2.122293
11 0.717157 -0.067665 -1.150331 -0.683567
12 1.006764 -0.869975 -1.646339 0.632909
13 0.076679 0.262971 0.687525 0.195338
14 -0.582238 0.236346 -0.903972 -0.223720
现在您不需要您提议的新标签栏;但是,如果你坚持让它只是将它插入每个新的sub_df。
for sub_df_index, sub_df in df.groupby(np.arange(len(df)) // 5):
sub_df["sub_index"] = sub_df_index
print(sub_df.head(10))
输出:
A B C D sub_index
0 0.113454 3.357840 -0.413755 -1.089784 0
1 0.800012 0.655826 0.688414 0.012480 0
2 0.604902 -0.332028 0.470119 -0.370570 0
3 0.661120 0.635879 -0.441816 -0.847047 0
4 0.836218 2.597254 1.029996 0.554012 0
A B C D sub_index
5 -0.236094 1.714750 -0.091074 0.182944 1
6 0.928875 -1.125854 0.493389 0.309107 1
7 -0.238064 1.566493 -0.244627 0.744391 1
8 0.041049 0.423166 1.020502 -0.467028 1
9 0.290232 2.119993 -0.174697 0.784637 1
A B C D sub_index
10 -0.600395 0.604698 0.220617 2.122293 2
11 0.717157 -0.067665 -1.150331 -0.683567 2
12 1.006764 -0.869975 -1.646339 0.632909 2
13 0.076679 0.262971 0.687525 0.195338 2
14 -0.582238 0.236346 -0.903972 -0.223720 2
修改强> 如果你只需要一个df,那就
df["sub_index"] = np.arange(len(df)) // 5
输出:
A B C D sub_index
0 -1.381390 0.523980 1.306372 0.000278 0
1 -0.425316 0.937133 0.627025 -0.439032 0
2 -0.443357 0.160292 0.450645 -0.366276 0
3 -2.222720 -1.768990 -0.067939 1.239722 0
4 2.039943 0.774243 0.108462 0.192314 0
5 -0.702514 -1.258634 -1.086802 1.151799 1
6 1.269017 1.115269 -0.417813 1.161220 1
7 -0.620205 -0.054393 0.431089 0.436805 1
8 -2.321976 -1.269446 0.927542 -0.069101 1
9 0.387243 0.055290 1.519623 -0.732410 1
10 -0.227690 -1.991782 -0.712146 0.003375 2
11 -1.396515 -0.074016 -1.141520 -0.226016 2
12 -0.430559 1.347512 -0.773859 1.016727 2
13 0.867294 0.924141 -0.484293 -0.666916 2
14 -0.224497 0.818024 1.057355 1.700363 2
15 -0.790723 -0.039521 1.529804 -0.415783 3