迭代python pandas中非常大的数据帧效率太耗时了

时间:2017-12-10 20:18:31

标签: python pandas dataframe bigdata

我试图在csv中迭代超过500万条记录。我坚持使用以下循环。

trajectory = 0
for index, row in df.iterrows():
    if row['trajectory'] == 'NaN':
        trajectory = trajectory +1
        df.loc[index, 'classification']= trajectory
    else:
        df.loc[index, 'classification'] = trajectory

当我遇到NaN'在我的DataFrame中,我增加了我的轨迹值,并将值放入我的分类'列。

我正在尝试使用较小的数据集,但是当我在完整的.5 gig csv中运行此代码时,需要数小时。

1 个答案:

答案 0 :(得分:4)

NaN比较字符串并使用cumsum

df['classification'] = (df['trajectory'] == 'NaN').cumsum() + trajectory

如果NaN缺少价值,请按isnull比较:

df['classification'] = df['trajectory'].isnull().cumsum() + trajectory

<强>计时

np.random.seed(2017)
L = ['s','a','NaN']
N = 1000
df = pd.DataFrame({
    'trajectory': np.random.choice(L, size=N)
})
#print (df)

trajectory = 0
def new(df, trajectory):
    df['classification'] = (df['trajectory'] == 'NaN').cumsum() + trajectory
    return df


def old(df, trajectory):
    for index, row in df.iterrows():
        if row['trajectory'] == 'NaN':
            trajectory = trajectory +1
            df.loc[index, 'classification']= trajectory
        else:
            df.loc[index, 'classification'] = trajectory
    return df
In [74]: %timeit (old(df, trajectory))
1 loop, best of 3: 609 ms per loop

In [75]: %timeit (new(df, trajectory))
1000 loops, best of 3: 928 µs per loop