Question

我有一个看起来像这样的CSV文件：

tid ||  instr_count || fnname
=============================
22  ||      892806  || main
22  ||          18  || randlc
22  ||         120  || makea

我想根据instr_count是否出现在给定列表中，将fnname的值合并在一起。例如，如果我的列表是['main', 'makea']，则最终表应如下所示：

tid ||  instr_count || fnname
=============================
22  ||      892806  || main
22  ||         138  || makea

我没有提前知道给定列表中2个值之间有多少条目 - 所以它可能更类似于：

tid ||  instr_count || fnname
=============================
22  ||      892806  || main
22  ||          18  || randlc
22  ||           7  || randlc
22  ||          35  || randlc
22  ||          20  || randlc
22  ||         120  || makea

应将其压缩为：

tid ||  instr_count || fnname
=============================
22  ||      892806  || main
22  ||         200  || makea

我使用pandas 0.17.1和python 2.7.6将这些值加载到Dataframe中。这就是我到目前为止所拥有的：

def compressDataframes(df):

    new_df = pd.DataFrame(columns=df.columns)
    instr_count = 0
    i = 0
    for row in df.itertuples():
        instr_count += row[2]
        if any(f in row[3] for f in FUNCS): #FUNCS is my "given list"
            new_df.loc[i] = [row[1], instr_count, row[3]]
            i += 1
            instr_count = 0

    return new_df

这有效，但我怀疑必须有一种方法可以更快地完成它（我正在使用一些非常大的（> 10 GB）数据集）。有没有人有任何建议？

Answer 1

我认为您可以使用isin与boolean indexing一起创建新列grouped，其中首先包含NaN，其中不是数据，然后是fillna填充有效观察以填补空白（回填）。最后一列instr_count列li = ['main','makea'] df['grouped'] = df.loc[df['fnname'].isin(li), 'fnname'] df['grouped'] = df['grouped'].fillna(method='bfill') print df tid instr_count fnname grouped 0 22 892806 main main 1 22 18 randlc makea 2 22 120 makea makea print df.groupby(['tid','grouped'])['instr_count'].sum().reset_index() tid grouped instr_count 0 22 main 892806 1 22 makea 138的{{3}}：

print df.groupby('grouped').agg({'tid':'first', 'instr_count': sum}).reset_index()

  grouped  tid  instr_count
0    main   22       892806
1   makea   22          138

或groupby：

li = ['main','makea']
df['grouped'] = df.loc[df['fnname'].isin(li), 'fnname']
df['grouped'] = df['grouped'].fillna(method='bfill')

print df
   tid  instr_count  fnname grouped
0   22       892806    main    main
1   22           18  randlc   makea
2   22            7  randlc   makea
3   22           35  randlc   makea
4   22           20  randlc   makea
5   22          120   makea   makea

print df.groupby(['tid','grouped'])['instr_count'].sum().reset_index()
  grouped  tid  instr_count
0    main   22       892806
1   makea   22          200

print df.groupby('grouped').agg({'tid':'first', 'instr_count': sum}).reset_index()
   tid grouped  instr_count
0   22    main       892806
1   22   makea          200

第二个样本：

#include <iostream>
#include <queue.h>
#include <queue.cpp>
#include <poli.h>
#include <poli.cpp>
#include <Stack.h>
#include <Stack.cpp>
using namespace std;

int main()
{

poli<term> p1(3);

poli<term> p2(4);

cout<<"Polinomul p1:"<<endl;
p1.read();
cout<<endl;
cout<<"Polinomul p2:"<<endl;
p2.read();
p1.sort();
p2.sort();

cout<<"Valoare in punctul x=3 a polinomului p1 este:"<<p1.val_pol(3)<<endl;
cout<<"Valoare in punctul x=3 a polinomului p2 este:"<<p2.val_pol(3)<<endl;

p1.display();
p1.invert();
cout<<endl;
p1.display();
cout<<endl;

p2.display();
p2.invert();
cout<<endl;
p2.display();

Stack<term> p3;

return 0;
}

合并Dataframe中未确定的行数

1 个答案: