使用值

时间:2017-08-21 10:52:38

标签: python pandas dataframe

我有一个非常大的数据框,它有100年的日期作为列标题(即~36500列)和100年的日期作为索引(即~36500行)。我有一个函数,它计算数据帧的每个元素的值,需要运行36500 ^ 2次。

好的,问题不是功能非常快,而是将值分配给数据帧。即使我以这种方式分配常数,每6个分配大约需要1秒。显然,我可以告诉你很厚:

for i, row in df_mBase.iterrows():
    for idx, val in enumerate(row):
        df_mBase.ix[i][idx] = 1
    print(i)

通常在C / Java中,我只需循环遍历36500x36500双循环并直接通过索引访问预分配的内存,这可以在几乎没有开销的情况下在恒定时间内实现。但这似乎不是python中的一个选项?

在数据框中存储此数据的最快方法是什么?蟒蛇与否,我只是在速度之后 - 我不关心优雅。

2 个答案:

答案 0 :(得分:3)

您应该在本机python或numpy中创建数据结构,并将数据传递给DataFrame构造函数。如果你的函数可以使用numpy函数/操作编写,那么你可以使用numpy的向量化特性来避免遍历所有索引。

这是一个带有补充功能的例子:

import numpy as np
import pandas as pd
import datetime as dt
import dateutil as du

dates = [dt.date(2017, 1, 1) - du.relativedelta.relativedelta(days=i) for i in range(36500)]
data = np.zeros((36500,36500), dtype=np.uint8)

def my_func(i, j):
    return (sum(divmod(i,j)) - sum(divmod(j,i))) % 255

for i in range(1, 36500):
    for j in range(1, 36500):
        data[i,j] = my_func(i,j)

df = pd.DataFrame(data, columns=dates, index=dates)

df.head(5)
#returns:

            2017-08-21  2017-08-20  2017-08-19  2017-08-18  2017-08-17  \
2017-08-21           0           0           0           0           0
2017-08-20           0           0         254         253         252
2017-08-19           0           1           0           0           0
2017-08-18           0           2           0           0           1
2017-08-17           0           3           0         254           0

               ...      1917-09-19  1917-09-18  1917-09-17  1917-09-16
2017-08-21     ...               0           0           0           0
2017-08-20     ...             225         224         223         222
2017-08-19     ...             114         113         113         112
2017-08-18     ...              77          76          77          76
2017-08-17     ...              60          59          58          57

答案 1 :(得分:3)

为什么这可能会减慢

有几个原因

.IX

.ix是一个神奇的类型索引器,可以同时执行标签和位置索引,但对于基于标签的.loc.iloc更严格,deprecated将为timing plot指数为主。 我假设.ix在幕后做了很多魔术,以确定是否需要标签或基于位置的索引

.iterrows

为每一行返回一个(new?)Series。基于列的迭代可能更快,因为.iteritems迭代列

[] []

df_mBase.ix[i][idx]返回Series,然后从中获取元素idx,并为其分配值1.

df_mBase.loc[i, idx] = 1

应该改善这个

基准

import pandas as pd

import itertools
import timeit


def generate_dummy_data(years=1):
    period = pd.Timedelta(365 * years, unit='D')

    start = pd.Timestamp('19000101')
    offset = pd.Timedelta(10, unit='h')

    dates1 = pd.DatetimeIndex(start=start, end=start + period, freq='d')
    dates2 = pd.DatetimeIndex(start=start + offset, end=start + offset + period, freq='d')

    return pd.DataFrame(index=dates1, columns=dates2, dtype=float)


def assign_original(df_orig):
    df_new = df_orig.copy(deep=True)
    for i, row in df_new.iterrows():
        for idx, val in enumerate(row):
            df_new.ix[i][idx] = 1
    return df_new


def assign_other(df_orig):
    df_new = df_orig.copy(deep=True)
    for (i, idx_i), (j, idx_j) in itertools.product(enumerate(df_new.index), enumerate(df_new.columns)):
        df_new[idx_j][idx_i] = 1
    return df_new


def assign_loc(df_orig):
    df_new = df_orig.copy(deep=True)
    for i, row in df_new.iterrows():
        for idx, val in enumerate(row):
            df_new.loc[i][idx] = 1
    return df_new


def assign_loc_product(df_orig):
    df_new = df_orig.copy(deep=True)
    for i, j in itertools.product(df_new.index, df_new.columns):
        df_new.loc[i, j] = 1
    return df_new


def assign_iloc_product(df_orig):
    df_new = df_orig.copy(deep=True)
    for (i, idx_i), (j, idx_j) in itertools.product(enumerate(df_new.index), enumerate(df_new.columns)):
        df_new.iloc[i, j] = 1
    return df_new


def assign_iloc_product_range(df_orig):
    df_new = df_orig.copy(deep=True)
    for i, j in itertools.product(range(len(df_new.index)), range(len(df_new.columns))):
        df_new.iloc[i, j] = 1
    return df_new


def assign_index(df_orig):
    df_new = df_orig.copy(deep=True)
    for (i, idx_i), (j, idx_j) in itertools.product(enumerate(df_new.index), enumerate(df_new.columns)):
        df_new[idx_j][idx_i] = 1
    return df_new


def assign_column(df_orig):
    df_new = df_orig.copy(deep=True)
    for c, column in df_new.iteritems():
        for idx, val in enumerate(column):
            df_new[c][idx] = 1
    return df_new


def assign_column2(df_orig):
    df_new = df_orig.copy(deep=True)
    for c, column in df_new.iteritems():
        for idx, val in enumerate(column):
            column[idx] = 1
    return df_new


def assign_itertuples(df_orig):
    df_new = df_orig.copy(deep=True)
    for i, row in enumerate(df_new.itertuples()):
        for idx, val in enumerate(row[1:]):
            df_new.iloc[i, idx] = 1
    return df_new


def assign_applymap(df_orig):
    df_new = df_orig.copy(deep=True)
    df_new = df_new.applymap(lambda x: 1)
    return df_new


def assign_vectorized(df_orig):
    df_new = df_orig.copy(deep=True)
    for i in df_new:
        df_new[i] = 1
    return df_new


methods = [
    ('assign_original', assign_original),
    ('assign_loc', assign_loc),
    ('assign_loc_product', assign_loc_product),
    ('assign_iloc_product', assign_iloc_product),
    ('assign_iloc_product_range', assign_iloc_product_range),
    ('assign_index', assign_index),
    ('assign_column', assign_column),
    ('assign_column2', assign_column2),
    ('assign_itertuples', assign_itertuples),
    ('assign_vectorized', assign_vectorized),
    ('assign_applymap', assign_applymap),
]


def get_timings(period=1, methods=()):
    print('=' * 10)
    print(f'generating timings for a period of {period} years')
    df_orig = generate_dummy_data(period)
    df_orig.info(verbose=False)
    repeats = 1
    for method_name, method in methods:
        result = pd.DataFrame()

        def my_method():
            """
            This looks a bit icky, but is the best way I found to make sure the values are really changed,
            and not just on a copy of a DataFrame
            """
            nonlocal result
            result = method(df_orig)

        t = timeit.Timer(my_method).timeit(number=repeats)

        assert result.iloc[3, 3] == 1

        print(f'{method_name} took {t / repeats} seconds')
        yield (method_name, {'time': t, 'memory': result.memory_usage(deep=True).sum()/1024})


periods = [0.03, 0.1, 0.3, 1, 3]


results = {period: dict(get_timings(period, methods)) for period in periods}

print(results)

timings_dict = {period: {k: v['time'] for k, v in result.items()} for period, result in results.items()}

df = pd.DataFrame.from_dict(timings_dict)
df.transpose().plot(logy=True).figure.savefig('test.png')
                              0.03        0.1         0.3         1.0         3.0
assign_applymap               0.001989    0.009862    0.018018    0.105569    0.549511
assign_vectorized             0.002974    0.008428    0.035994    0.162565    3.810138
assign_index                  0.013717    0.137134    1.288852    14.190128   111.102662
assign_column2                0.026260    0.186588    1.664345    19.204453   143.103077
assign_column                 0.016811    0.212158    1.838733    21.053627   153.827845
assign_itertuples             0.025130    0.249886    2.125968    24.639593   185.975111
assign_iloc_product_range     0.026982    0.247069    2.199019    23.902244   186.548500
assign_iloc_product           0.021225    0.233454    2.437183    25.143673   218.849143
assign_loc_product            0.018743    0.290104    2.515379    32.778794   258.244436
assign_loc                    0.029050    0.349551    2.822797    32.087433   294.052933
assign_original               0.034315    0.337207    2.714154    30.361072   332.327008

结论

documentation

如果您可以使用矢量化,请执行此操作。根据计算,您可以使用其他方法。如果你只需要使用的值,applymap似乎最快。如果您还需要索引和/或列,请使用列

如果你无法进行向量化,那么df[column][index] = x的工作速度最快,迭代df.iteritems()作为近距离的列