我有一个看起来像这样的CSV文件:
tid || instr_count || fnname
=============================
22 || 892806 || main
22 || 18 || randlc
22 || 120 || makea
我想根据instr_count
是否出现在给定列表中,将fnname
的值合并在一起。例如,如果我的列表是['main', 'makea']
,则最终表应如下所示:
tid || instr_count || fnname
=============================
22 || 892806 || main
22 || 138 || makea
我没有提前知道给定列表中2个值之间有多少条目 - 所以它可能更类似于:
tid || instr_count || fnname
=============================
22 || 892806 || main
22 || 18 || randlc
22 || 7 || randlc
22 || 35 || randlc
22 || 20 || randlc
22 || 120 || makea
应将其压缩为:
tid || instr_count || fnname
=============================
22 || 892806 || main
22 || 200 || makea
我使用pandas 0.17.1和python 2.7.6将这些值加载到Dataframe
中。这就是我到目前为止所拥有的:
def compressDataframes(df):
new_df = pd.DataFrame(columns=df.columns)
instr_count = 0
i = 0
for row in df.itertuples():
instr_count += row[2]
if any(f in row[3] for f in FUNCS): #FUNCS is my "given list"
new_df.loc[i] = [row[1], instr_count, row[3]]
i += 1
instr_count = 0
return new_df
这有效,但我怀疑必须有一种方法可以更快地完成它(我正在使用一些非常大的(> 10 GB)数据集)。有没有人有任何建议?
答案 0 :(得分:1)
我认为您可以使用isin
与boolean indexing
一起创建新列grouped
,其中首先包含NaN
,其中不是数据,然后是fillna
填充有效观察以填补空白(回填)。最后一列instr_count
列li = ['main','makea']
df['grouped'] = df.loc[df['fnname'].isin(li), 'fnname']
df['grouped'] = df['grouped'].fillna(method='bfill')
print df
tid instr_count fnname grouped
0 22 892806 main main
1 22 18 randlc makea
2 22 120 makea makea
print df.groupby(['tid','grouped'])['instr_count'].sum().reset_index()
tid grouped instr_count
0 22 main 892806
1 22 makea 138
的{{3}}:
print df.groupby('grouped').agg({'tid':'first', 'instr_count': sum}).reset_index()
grouped tid instr_count
0 main 22 892806
1 makea 22 138
或groupby
:
li = ['main','makea']
df['grouped'] = df.loc[df['fnname'].isin(li), 'fnname']
df['grouped'] = df['grouped'].fillna(method='bfill')
print df
tid instr_count fnname grouped
0 22 892806 main main
1 22 18 randlc makea
2 22 7 randlc makea
3 22 35 randlc makea
4 22 20 randlc makea
5 22 120 makea makea
print df.groupby(['tid','grouped'])['instr_count'].sum().reset_index()
grouped tid instr_count
0 main 22 892806
1 makea 22 200
print df.groupby('grouped').agg({'tid':'first', 'instr_count': sum}).reset_index()
tid grouped instr_count
0 22 main 892806
1 22 makea 200
第二个样本:
#include <iostream>
#include <queue.h>
#include <queue.cpp>
#include <poli.h>
#include <poli.cpp>
#include <Stack.h>
#include <Stack.cpp>
using namespace std;
int main()
{
poli<term> p1(3);
poli<term> p2(4);
cout<<"Polinomul p1:"<<endl;
p1.read();
cout<<endl;
cout<<"Polinomul p2:"<<endl;
p2.read();
p1.sort();
p2.sort();
cout<<"Valoare in punctul x=3 a polinomului p1 este:"<<p1.val_pol(3)<<endl;
cout<<"Valoare in punctul x=3 a polinomului p2 este:"<<p2.val_pol(3)<<endl;
p1.display();
p1.invert();
cout<<endl;
p1.display();
cout<<endl;
p2.display();
p2.invert();
cout<<endl;
p2.display();
Stack<term> p3;
return 0;
}