python pandas dataframe:需要加速与计算3行数据相关的过程

时间:2017-04-05 08:37:29

标签: python pandas dataframe

我的数据如下:

Tran|Type|Amount|comment
1212|A|12|Buy
1212|AA|13|Buy
1212|CC|25|S
1213|AA|1112|B
1213|A|78|B
1213|CC|1190|SEllding
1214|AA|1112|B
1214|A|78|B
1214|CC|1190|SEllding
1215|AA|1112|B
1215|A|78|B
1216|AA|1112|B


....

我需要过滤掉所有具有3种类型A,AA,CC和A.Amount + AA.Amount = CC.Amount

的tran

数据量巨大(100M记录)

我的代码如下,但运行速度很慢

df1=df.groupby("tran").filter(lambda x: len(x) == 3)
listrefn=df1.tran.tolist()
df1=df[df.tran.isin(listrefn)]
df2=df1[df1.field=='A']
dfA=df2[['tran','Amount']]
df2=df1[df1.field=='AA']
dfAA=df2[['tran','Amount']]
df2=df1[df1.field=='CC']
dfCC=df2[['tran','Amount']]

dfA=dfA.rename(columns={'tran':'tran','Amount':'A'})
dfAA=dfAA.rename(columns={'tran':'tran','Amount':'AA'})
dfCC=dfCC.rename(columns={'tran':'tran','Amount':'CC'})

dftmp=pandas.merge(dfA,dfAA,how='left')
dftmp1=pandas.merge(dftmp,dfCC,how='left')
dftmp1['diff']=dftmp1.A-dftmp1.AA-dftmp1.CC
dftmp=dftmp1[['tran','diff']]
dftmp1=dftmp[dftmp['diff']==0]

请帮助建议

3 个答案:

答案 0 :(得分:3)

您可以pivot使用query

df = df.set_index('Tran').loc[idx].reset_index()
print (df)
   Tran Type  Amount   comment
0  1212    A      12       Buy
1  1212   AA      13       Buy
2  1212   CC      25         S
3  1213   AA    1112         B
4  1213    A      78         B
5  1213   CC    1190  SEllding
6  1214   AA    1112         B
7  1214    A      78         B
8  1214   CC    1190  SEllding

另一种过滤解决方案:

VS_VERSION_INFO VERSIONINFO
FILEVERSION BUILDVRS1,BUILDVRS2,BUILDVRS3,BUILDVRS4
PRODUCTVERSION BUILDVRS1,BUILDVRS2,BUILDVRS3,BUILDVRS0
FILEFLAGSMASK 0x3fL
#ifdef _DEBUG
FILEFLAGS 0x1L
#else
FILEFLAGS 0x0L
#endif
FILEOS 0x4L
FILETYPE 0x2L
FILESUBTYPE 0x0L
BEGIN
    BLOCK "StringFileInfo"
    BEGIN
        BLOCK "040904b0"
        BEGIN
            VALUE "LastModified" , "20170313\0"
            VALUE "Comments", "\0"
            VALUE "CompanyName", BUILDCOMPANY
            VALUE "FileVersion", BUILDFILE
            VALUE "LegalCopyright", "Copyright (C) 1999-2002\0"
            VALUE "LegalTrademarks", "\0"
            VALUE "ProductVersion", BUILDPROD
            VALUE "InternalName", "SeriCom\0"
            VALUE "OriginalFilename", "SeriCom.DLL\0"
            VALUE "FileDescription", "SeriCom DLL\0"
            VALUE "ProductName", "SeriCom Dynamic Link Library\0"
            VALUE "BuildMach", BUILDMACH
            VALUE "BuildDate", BUILDDATE
            VALUE "BuildType", BUILDTYPE
            VALUE "BuildVers", BUILDVPRX
            VALUE "BuildNumb", BUILDNUMB
            VALUE "BuildChar", BUILDCHAR
            VALUE "BuildEnv1", BUILDCOMP
            VALUE "BuildEnv2", BUILDMFC
        END
    END
    BLOCK "VarFileInfo"
    BEGIN
        VALUE "Translation", 0x409, 1200
    END
END

答案 1 :(得分:3)

使用set_index。好的是,A + AA == CC不会发生,除非所有三个都在那里,所以不需要检查是否所有三个都在那里。

df.set_index(['Tran', 'Type']).Amount.unstack().query('A + AA == CC')

Type     A      AA      CC
Tran                      
1212  12.0    13.0    25.0
1213  78.0  1112.0  1190.0
1214  78.0  1112.0  1190.0

您可以使用

获取原始子集
t = df.set_index(['Tran', 'Type']).Amount.unstack().query('A + AA == CC').index
df.query("Tran in @t")
# equivalently
# df[df.Tran.isin(t)]

   Tran Type  Amount   comment
0  1212    A      12       Buy
1  1212   AA      13       Buy
2  1212   CC      25         S
3  1213   AA    1112         B
4  1213    A      78         B
5  1213   CC    1190  SEllding
6  1214   AA    1112         B
7  1214    A      78         B
8  1214   CC    1190  SEllding

答案 2 :(得分:2)

更新:查看perfect @piRSquared's solution我意识到我们不需要事先过滤源DF。

所以这应该足够了:

In [23]: x = df.groupby("Tran").filter(lambda x: len(x) == 3)

In [24]: x
Out[24]:
   Tran Type  Amount   comment
0  1212    A      12       Buy
1  1212   AA      13       Buy
2  1212   CC      25         S
3  1213   AA    1112         B
4  1213    A      78         B
5  1213   CC    1190  SEllding
6  1214   AA    1112         B
7  1214    A      78         B
8  1214   CC    1190  SEllding

In [25]: x.pivot(index='Tran', columns='Type', values='Amount').query('A + AA == CC')
Out[25]:
Type   A    AA    CC
Tran
1212  12    13    25
1213  78  1112  1190
1214  78  1112  1190

OLD回答:

console.log