我的数据如下:
Tran|Type|Amount|comment
1212|A|12|Buy
1212|AA|13|Buy
1212|CC|25|S
1213|AA|1112|B
1213|A|78|B
1213|CC|1190|SEllding
1214|AA|1112|B
1214|A|78|B
1214|CC|1190|SEllding
1215|AA|1112|B
1215|A|78|B
1216|AA|1112|B
....
我需要过滤掉所有具有3种类型A,AA,CC和A.Amount + AA.Amount = CC.Amount
的tran数据量巨大(100M记录)
我的代码如下,但运行速度很慢
df1=df.groupby("tran").filter(lambda x: len(x) == 3)
listrefn=df1.tran.tolist()
df1=df[df.tran.isin(listrefn)]
df2=df1[df1.field=='A']
dfA=df2[['tran','Amount']]
df2=df1[df1.field=='AA']
dfAA=df2[['tran','Amount']]
df2=df1[df1.field=='CC']
dfCC=df2[['tran','Amount']]
dfA=dfA.rename(columns={'tran':'tran','Amount':'A'})
dfAA=dfAA.rename(columns={'tran':'tran','Amount':'AA'})
dfCC=dfCC.rename(columns={'tran':'tran','Amount':'CC'})
dftmp=pandas.merge(dfA,dfAA,how='left')
dftmp1=pandas.merge(dftmp,dfCC,how='left')
dftmp1['diff']=dftmp1.A-dftmp1.AA-dftmp1.CC
dftmp=dftmp1[['tran','diff']]
dftmp1=dftmp[dftmp['diff']==0]
请帮助建议
答案 0 :(得分:3)
df = df.set_index('Tran').loc[idx].reset_index()
print (df)
Tran Type Amount comment
0 1212 A 12 Buy
1 1212 AA 13 Buy
2 1212 CC 25 S
3 1213 AA 1112 B
4 1213 A 78 B
5 1213 CC 1190 SEllding
6 1214 AA 1112 B
7 1214 A 78 B
8 1214 CC 1190 SEllding
另一种过滤解决方案:
VS_VERSION_INFO VERSIONINFO
FILEVERSION BUILDVRS1,BUILDVRS2,BUILDVRS3,BUILDVRS4
PRODUCTVERSION BUILDVRS1,BUILDVRS2,BUILDVRS3,BUILDVRS0
FILEFLAGSMASK 0x3fL
#ifdef _DEBUG
FILEFLAGS 0x1L
#else
FILEFLAGS 0x0L
#endif
FILEOS 0x4L
FILETYPE 0x2L
FILESUBTYPE 0x0L
BEGIN
BLOCK "StringFileInfo"
BEGIN
BLOCK "040904b0"
BEGIN
VALUE "LastModified" , "20170313\0"
VALUE "Comments", "\0"
VALUE "CompanyName", BUILDCOMPANY
VALUE "FileVersion", BUILDFILE
VALUE "LegalCopyright", "Copyright (C) 1999-2002\0"
VALUE "LegalTrademarks", "\0"
VALUE "ProductVersion", BUILDPROD
VALUE "InternalName", "SeriCom\0"
VALUE "OriginalFilename", "SeriCom.DLL\0"
VALUE "FileDescription", "SeriCom DLL\0"
VALUE "ProductName", "SeriCom Dynamic Link Library\0"
VALUE "BuildMach", BUILDMACH
VALUE "BuildDate", BUILDDATE
VALUE "BuildType", BUILDTYPE
VALUE "BuildVers", BUILDVPRX
VALUE "BuildNumb", BUILDNUMB
VALUE "BuildChar", BUILDCHAR
VALUE "BuildEnv1", BUILDCOMP
VALUE "BuildEnv2", BUILDMFC
END
END
BLOCK "VarFileInfo"
BEGIN
VALUE "Translation", 0x409, 1200
END
END
答案 1 :(得分:3)
使用set_index
。好的是,A + AA == CC
不会发生,除非所有三个都在那里,所以不需要检查是否所有三个都在那里。
df.set_index(['Tran', 'Type']).Amount.unstack().query('A + AA == CC')
Type A AA CC
Tran
1212 12.0 13.0 25.0
1213 78.0 1112.0 1190.0
1214 78.0 1112.0 1190.0
您可以使用
获取原始子集t = df.set_index(['Tran', 'Type']).Amount.unstack().query('A + AA == CC').index
df.query("Tran in @t")
# equivalently
# df[df.Tran.isin(t)]
Tran Type Amount comment
0 1212 A 12 Buy
1 1212 AA 13 Buy
2 1212 CC 25 S
3 1213 AA 1112 B
4 1213 A 78 B
5 1213 CC 1190 SEllding
6 1214 AA 1112 B
7 1214 A 78 B
8 1214 CC 1190 SEllding
答案 2 :(得分:2)
更新:查看perfect @piRSquared's solution我意识到我们不需要事先过滤源DF。
所以这应该足够了:
In [23]: x = df.groupby("Tran").filter(lambda x: len(x) == 3)
In [24]: x
Out[24]:
Tran Type Amount comment
0 1212 A 12 Buy
1 1212 AA 13 Buy
2 1212 CC 25 S
3 1213 AA 1112 B
4 1213 A 78 B
5 1213 CC 1190 SEllding
6 1214 AA 1112 B
7 1214 A 78 B
8 1214 CC 1190 SEllding
In [25]: x.pivot(index='Tran', columns='Type', values='Amount').query('A + AA == CC')
Out[25]:
Type A AA CC
Tran
1212 12 13 25
1213 78 1112 1190
1214 78 1112 1190
OLD回答:
console.log