Pandas Dataframe按行数

时间:2018-03-13 19:02:21

标签: python pandas

我有一个大熊猫DF(此时)大约有4000行,我按照我需要的顺序放了......

我需要通过rowcount将DF分成任意10个bin。因此,我希望将bin编号附加到行以供将来的聚合计算使用bin..about 400,因为最后一个bin不会被填写。

我现在不想在一个操作中执行此操作,我只需要将行号附加到行中。这样我就可以通过聚合在另一个步骤中完成该组。

我已将pd.qcut和pd.cut捆绑在一起,但似乎没有运气。

ccs_category    sex race    marital_status  presence_of_child   payer_type  no_of_persons   occupation  education   wealth_rate donate_charity  zip age 48_prediction   count   cum_count   cum_accurate    cum_percent
 263    218 M   U   S   N   PK  999 99  0   99  U   60657.0 0   0.000538    1   1   0   0.0
2452    250 M   W   U   U   NK  0   99  0   99  U   8730.0  29  0.000404    1   2   0   0.0
2814    127 F   W   U   N   MK  2   8   2   9   Y   53051.0 75  0.000369    1   3   0   0.0

所以我希望最后一列是bin_nu,顺序标签为1-10(0-9)就可以了。 E.G第一个400标签= 1第二个400标签= 2等。

results.3['bin_nu']=pd.cut(results3,10,labels=False)

我尝试使用retbins = True但没有结果 给了我:

TypeError: '<=' not supported between instances of 'str' and 'int'

我看了一下但是一切似乎都是按值(和时间序列)而不是任意项目计数进行分类?

以下是DF的前十行到dict的转换:

{'48_prediction': {263: 0.0005377883207984269,


578: 0.00030468139448203146,
  1168: 0.0003036215784959495,
  2021: 0.0003635243338067085,
  2452: 0.0004036910831928253,
  2518: 0.00029300281312316656,
  2814: 0.00036939769051969046,
  3687: 0.0003046905330847949,
  3963: 0.00029347193776629865,
  7132: 0.00035702803870663047},
 'age': {263: 0,
  578: 68,
  1168: 1,
  2021: 42,
  2452: 29,
  2518: 92,
  2814: 75,
  3687: 40,
  3963: 51,
  7132: 34},
 'ccs_category': {263: '218',
  578: '250',
  1168: '166',
  2021: '90',
  2452: '250',
  2518: '3',
  2814: '127',
  3687: '58',
  3963: '160',
  7132: '196'},
 'count': {263: 1,
  578: 1,
  1168: 1,
  2021: 1,
  2452: 1,
  2518: 1,
  2814: 1,
  3687: 1,
  3963: 1,
  7132: 1},
 'cum_accurate': {263: 0,
  578: 0,
  1168: 0,
  2021: 0,
  2452: 0,
  2518: 0,
  2814: 0,
  3687: 0,
  3963: 0,
  7132: 0},
 'cum_count': {263: 1,
  578: 7,
  1168: 8,
  2021: 4,
  2452: 2,
  2518: 10,
  2814: 3,
  3687: 6,
  3963: 9,
  7132: 5},
 'cum_percent': {263: 0.0,
  578: 0.0,
  1168: 0.0,
  2021: 0.0,
  2452: 0.0,
  2518: 0.0,
  2814: 0.0,
  3687: 0.0,
  3963: 0.0,
  7132: 0.0},
 'donate_charity': {263: 'U',
  578: 'U',
  1168: 'Y',
  2021: 'U',
  2452: 'U',
  2518: 'U',
  2814: 'Y',
  3687: 'U',
  3963: 'U',
  7132: 'U'},
 'education': {263: 0,
  578: 0,
  1168: 0,
  2021: 1,
  2452: 0,
  2518: 0,
  2814: 2,
  3687: 0,
  3963: 4,
  7132: 0},
 'marital_status': {263: 'S',
  578: 'U',
  1168: 'U',
  2021: 'M',
  2452: 'U',
  2518: 'U',
  2814: 'U',
  3687: 'U',
  3963: 'M',
  7132: 'U'},
 'no_of_persons': {263: 999,
  578: 0,
  1168: 2,
  2021: 2,
  2452: 0,
  2518: 0,
  2814: 2,
  3687: 0,
  3963: 3,
  7132: 0},
 'occupation': {263: '99',
  578: '99',
  1168: '99',
  2021: '0',
  2452: '99',
  2518: '99',
  2814: '8',
  3687: '99',
  3963: '0',
  7132: '99'},
 'payer_type': {263: 'PK',
  578: 'MI',
  1168: 'PK',
  2021: 'NK',
  2452: 'NK',
  2518: 'MK',
  2814: 'MK',
  3687: 'PK',
  3963: 'PK',
  7132: 'NK'},
 'presence_of_child': {263: 'N',
  578: 'U',
  1168: 'Y',
  2021: 'Y',
  2452: 'U',
  2518: 'U',
  2814: 'N',
  3687: 'Y',
  3963: 'N',
  7132: 'U'},
 'race': {263: 'U',
  578: 'B',
  1168: 'W',
  2021: 'W',
  2452: 'W',
  2518: 'W',
  2814: 'W',
  3687: 'B',
  3963: 'W',
  7132: 'A'},
 'sex': {263: 'M',
  578: 'F',
  1168: 'M',
  2021: 'M',
  2452: 'M',
  2518: 'M',
  2814: 'F',
  3687: 'F',
  3963: 'F',
  7132: 'F'},
 'wealth_rate': {263: '99',
  578: '7',
  1168: '8',
  2021: '6',
  2452: '99',
  2518: '99',
  2814: '9',
  3687: '0',
  3963: '6',
  7132: '8'},
 'zip': {263: 60657.0,
  578: 76021.0,
  1168: 85711.0,
  2021: 7747.0,
  2452: 8730.0,
  2518: 30680.0,
  2814: 53051.0,
  3687: 7740.0,
  3963: 62025.0,
  7132: 19082.0}}

1 个答案:

答案 0 :(得分:1)

让我们从一个包含50行的随机数据框开始:

df = pd.DataFrame(np.random.randn(50, 4), columns=list("ABCD"))

           A         B         C         D
0   0.113454  3.357840 -0.413755 -1.089784
1   0.800012  0.655826  0.688414  0.012480
2   0.604902 -0.332028  0.470119 -0.370570
3   0.661120  0.635879 -0.441816 -0.847047
4   0.836218  2.597254  1.029996  0.554012
..  0.076679  0.262971  0.687525  0.195338
49  1.948361 -0.801236  2.075301 -0.540771

你可以使用groupby来填充它并获得块:

for sub_df_index, sub_df in df.groupby(np.arange(len(df)) // 10):
    print(sub_df.head(10))

前三个块:

          A         B         C         D
0  0.113454  3.357840 -0.413755 -1.089784
1  0.800012  0.655826  0.688414  0.012480
2  0.604902 -0.332028  0.470119 -0.370570
3  0.661120  0.635879 -0.441816 -0.847047
4  0.836218  2.597254  1.029996  0.554012
          A         B         C         D
5 -0.236094  1.714750 -0.091074  0.182944
6  0.928875 -1.125854  0.493389  0.309107
7 -0.238064  1.566493 -0.244627  0.744391
8  0.041049  0.423166  1.020502 -0.467028
9  0.290232  2.119993 -0.174697  0.784637
           A         B         C         D
10 -0.600395  0.604698  0.220617  2.122293
11  0.717157 -0.067665 -1.150331 -0.683567
12  1.006764 -0.869975 -1.646339  0.632909
13  0.076679  0.262971  0.687525  0.195338
14 -0.582238  0.236346 -0.903972 -0.223720

现在您不需要您提议的新标签栏;但是,如果你坚持让它只是将它插入每个新的sub_df。

for sub_df_index, sub_df in df.groupby(np.arange(len(df)) // 5):
    sub_df["sub_index"] = sub_df_index
    print(sub_df.head(10))

输出:

         A         B         C         D  sub_index
0  0.113454  3.357840 -0.413755 -1.089784          0
1  0.800012  0.655826  0.688414  0.012480          0
2  0.604902 -0.332028  0.470119 -0.370570          0
3  0.661120  0.635879 -0.441816 -0.847047          0
4  0.836218  2.597254  1.029996  0.554012          0
          A         B         C         D  sub_index
5 -0.236094  1.714750 -0.091074  0.182944          1
6  0.928875 -1.125854  0.493389  0.309107          1
7 -0.238064  1.566493 -0.244627  0.744391          1
8  0.041049  0.423166  1.020502 -0.467028          1
9  0.290232  2.119993 -0.174697  0.784637          1
           A         B         C         D  sub_index
10 -0.600395  0.604698  0.220617  2.122293          2
11  0.717157 -0.067665 -1.150331 -0.683567          2
12  1.006764 -0.869975 -1.646339  0.632909          2
13  0.076679  0.262971  0.687525  0.195338          2
14 -0.582238  0.236346 -0.903972 -0.223720          2

修改 如果你只需要一个df,那就

df["sub_index"] = np.arange(len(df)) // 5

输出:

           A         B         C         D  sub_index
0  -1.381390  0.523980  1.306372  0.000278          0
1  -0.425316  0.937133  0.627025 -0.439032          0
2  -0.443357  0.160292  0.450645 -0.366276          0
3  -2.222720 -1.768990 -0.067939  1.239722          0
4   2.039943  0.774243  0.108462  0.192314          0
5  -0.702514 -1.258634 -1.086802  1.151799          1
6   1.269017  1.115269 -0.417813  1.161220          1
7  -0.620205 -0.054393  0.431089  0.436805          1
8  -2.321976 -1.269446  0.927542 -0.069101          1
9   0.387243  0.055290  1.519623 -0.732410          1
10 -0.227690 -1.991782 -0.712146  0.003375          2
11 -1.396515 -0.074016 -1.141520 -0.226016          2
12 -0.430559  1.347512 -0.773859  1.016727          2
13  0.867294  0.924141 -0.484293 -0.666916          2
14 -0.224497  0.818024  1.057355  1.700363          2
15 -0.790723 -0.039521  1.529804 -0.415783          3