获取每个组中具有相同列值的最高值

时间:2016-05-04 10:44:38

标签: python pandas dataframe

我有一张看起来像这样的表:

@ECHO OFF &SETLOCAL

:Input
    set /p version=Please Enter Version: 


:Replacement
    SET "file=test.bat"
    SET /a Line#ToSearch=4
    SET "Replacement=set jversion = %Version%_x86"

(FOR /f "delims=" %%a IN ('findstr /n "^" "%file%"') DO (
        SET "Line=%%a"
        rem // Use a `for /F` loop to extract the line number:
        for /F "delims=:" %%N in ("%%a") do set "LNum=%%N"
        SETLOCAL ENABLEDELAYEDEXPANSION
        rem // Use sub-string replacement to split off
        rem // the preceding line number and one colon:
        SET "Line=!Line:*:=!"
        IF !LNum! equ %Line#ToSearch% SET "Line=%Replacement%"
        ECHO(!Line!
        ENDLOCAL
    ))>"%file%.new"
TYPE "%file%.new"
MOVE "%file%.new" "%file%"

想象一下,我试图获得每个群组的最高价值('第1列和第39页)

通常我只是.head(n)但在这种情况下我也试图只获得具有相同Column 3值的顶行:

Column 1 | Column 2 | Column 3
   1           a         100
   1           r         100
   1           h         200
   1           j         200
   2           a         50
   2           q         50
   2           k         40
   3           a         10
   3           q         150
   3           k         150

假设表格已经按照我想要的顺序

任何建议都将受到高度赞赏

2 个答案:

答案 0 :(得分:1)

我认为首先需要groupbyfirst然后merge

df = pd.concat([df]*1000).reset_index(drop=True)

%timeit pd.merge(df, df.groupby('Column 1')['Column 3'].first().reset_index(), on=['Column 1','Column 3'])
100 loops, best of 3: 3.58 ms per loop

%timeit df[(df.assign(diff=df.groupby('Column 1')['Column 3'].diff().fillna(0)).groupby('Column 1')['diff'].cumsum() == 0)]
100 loops, best of 3: 5.06 ms per loop

<强>计时

{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Description": "WrapperTemplate",

  "Resources": {
    "WrappedStackWithStackLevelTags": {
      "Type" : "AWS::CloudFormation::Stack",
      "Properties" : {
        "Tags" : [ { "Key" : "Stage", "Value" : "QA" } ],
        "TemplateURL" : "your-original-template-s3-url"
      }
    }
  }
}

答案 1 :(得分:0)

我的解决方案(没有合并):

In [83]: idx = (df.assign(diff=df.groupby('Column1')['Column3'].diff().fillna(0))
   ....:          .groupby('Column1')['diff'].cumsum() == 0
   ....:       )

In [84]: df[idx]
Out[84]:
   Column1 Column2  Column3
0        1       a      100
1        1       r      100
4        2       a       50
5        2       q       50
7        3       a       10

说明:

In [85]: df.assign(diff=df.groupby('Column1')['Column3'].diff().fillna(0))
Out[85]:
   Column1 Column2  Column3   diff
0        1       a      100    0.0
1        1       r      100    0.0
2        1       h      200  100.0
3        1       j      200    0.0
4        2       a       50    0.0
5        2       q       50    0.0
6        2       k       40  -10.0
7        3       a       10    0.0
8        3       q      150  140.0
9        3       k      150    0.0

In [86]: df.assign(diff=df.groupby('Column1')['Column3'].diff().fillna(0)).groupby('Column1')['diff'].cumsum()
Out[86]:
0      0.0
1      0.0
2    100.0
3    100.0
4      0.0
5      0.0
6    -10.0
7      0.0
8    140.0
9    140.0
Name: diff, dtype: float64