我有一个非常大的数据框,它有100年的日期作为列标题(即~36500列)和100年的日期作为索引(即~36500行)。我有一个函数,它计算数据帧的每个元素的值,需要运行36500 ^ 2次。
好的,问题不是功能非常快,而是将值分配给数据帧。即使我以这种方式分配常数,每6个分配大约需要1秒。显然,我可以告诉你很厚:
for i, row in df_mBase.iterrows():
for idx, val in enumerate(row):
df_mBase.ix[i][idx] = 1
print(i)
通常在C / Java中,我只需循环遍历36500x36500双循环并直接通过索引访问预分配的内存,这可以在几乎没有开销的情况下在恒定时间内实现。但这似乎不是python中的一个选项?
在数据框中存储此数据的最快方法是什么?蟒蛇与否,我只是在速度之后 - 我不关心优雅。
答案 0 :(得分:3)
您应该在本机python或numpy中创建数据结构,并将数据传递给DataFrame构造函数。如果你的函数可以使用numpy函数/操作编写,那么你可以使用numpy的向量化特性来避免遍历所有索引。
这是一个带有补充功能的例子:
import numpy as np
import pandas as pd
import datetime as dt
import dateutil as du
dates = [dt.date(2017, 1, 1) - du.relativedelta.relativedelta(days=i) for i in range(36500)]
data = np.zeros((36500,36500), dtype=np.uint8)
def my_func(i, j):
return (sum(divmod(i,j)) - sum(divmod(j,i))) % 255
for i in range(1, 36500):
for j in range(1, 36500):
data[i,j] = my_func(i,j)
df = pd.DataFrame(data, columns=dates, index=dates)
df.head(5)
#returns:
2017-08-21 2017-08-20 2017-08-19 2017-08-18 2017-08-17 \
2017-08-21 0 0 0 0 0
2017-08-20 0 0 254 253 252
2017-08-19 0 1 0 0 0
2017-08-18 0 2 0 0 1
2017-08-17 0 3 0 254 0
... 1917-09-19 1917-09-18 1917-09-17 1917-09-16
2017-08-21 ... 0 0 0 0
2017-08-20 ... 225 224 223 222
2017-08-19 ... 114 113 113 112
2017-08-18 ... 77 76 77 76
2017-08-17 ... 60 59 58 57
答案 1 :(得分:3)
为什么这可能会减慢
有几个原因 .ix
是一个神奇的类型索引器,可以同时执行标签和位置索引,但对于基于标签的.loc
和.iloc
更严格,deprecated将为指数为主。
我假设.ix
在幕后做了很多魔术,以确定是否需要标签或基于位置的索引
为每一行返回一个(new?)Series
。基于列的迭代可能更快,因为.iteritems
迭代列
df_mBase.ix[i][idx]
返回Series
,然后从中获取元素idx
,并为其分配值1.
df_mBase.loc[i, idx] = 1
应该改善这个
import pandas as pd
import itertools
import timeit
def generate_dummy_data(years=1):
period = pd.Timedelta(365 * years, unit='D')
start = pd.Timestamp('19000101')
offset = pd.Timedelta(10, unit='h')
dates1 = pd.DatetimeIndex(start=start, end=start + period, freq='d')
dates2 = pd.DatetimeIndex(start=start + offset, end=start + offset + period, freq='d')
return pd.DataFrame(index=dates1, columns=dates2, dtype=float)
def assign_original(df_orig):
df_new = df_orig.copy(deep=True)
for i, row in df_new.iterrows():
for idx, val in enumerate(row):
df_new.ix[i][idx] = 1
return df_new
def assign_other(df_orig):
df_new = df_orig.copy(deep=True)
for (i, idx_i), (j, idx_j) in itertools.product(enumerate(df_new.index), enumerate(df_new.columns)):
df_new[idx_j][idx_i] = 1
return df_new
def assign_loc(df_orig):
df_new = df_orig.copy(deep=True)
for i, row in df_new.iterrows():
for idx, val in enumerate(row):
df_new.loc[i][idx] = 1
return df_new
def assign_loc_product(df_orig):
df_new = df_orig.copy(deep=True)
for i, j in itertools.product(df_new.index, df_new.columns):
df_new.loc[i, j] = 1
return df_new
def assign_iloc_product(df_orig):
df_new = df_orig.copy(deep=True)
for (i, idx_i), (j, idx_j) in itertools.product(enumerate(df_new.index), enumerate(df_new.columns)):
df_new.iloc[i, j] = 1
return df_new
def assign_iloc_product_range(df_orig):
df_new = df_orig.copy(deep=True)
for i, j in itertools.product(range(len(df_new.index)), range(len(df_new.columns))):
df_new.iloc[i, j] = 1
return df_new
def assign_index(df_orig):
df_new = df_orig.copy(deep=True)
for (i, idx_i), (j, idx_j) in itertools.product(enumerate(df_new.index), enumerate(df_new.columns)):
df_new[idx_j][idx_i] = 1
return df_new
def assign_column(df_orig):
df_new = df_orig.copy(deep=True)
for c, column in df_new.iteritems():
for idx, val in enumerate(column):
df_new[c][idx] = 1
return df_new
def assign_column2(df_orig):
df_new = df_orig.copy(deep=True)
for c, column in df_new.iteritems():
for idx, val in enumerate(column):
column[idx] = 1
return df_new
def assign_itertuples(df_orig):
df_new = df_orig.copy(deep=True)
for i, row in enumerate(df_new.itertuples()):
for idx, val in enumerate(row[1:]):
df_new.iloc[i, idx] = 1
return df_new
def assign_applymap(df_orig):
df_new = df_orig.copy(deep=True)
df_new = df_new.applymap(lambda x: 1)
return df_new
def assign_vectorized(df_orig):
df_new = df_orig.copy(deep=True)
for i in df_new:
df_new[i] = 1
return df_new
methods = [
('assign_original', assign_original),
('assign_loc', assign_loc),
('assign_loc_product', assign_loc_product),
('assign_iloc_product', assign_iloc_product),
('assign_iloc_product_range', assign_iloc_product_range),
('assign_index', assign_index),
('assign_column', assign_column),
('assign_column2', assign_column2),
('assign_itertuples', assign_itertuples),
('assign_vectorized', assign_vectorized),
('assign_applymap', assign_applymap),
]
def get_timings(period=1, methods=()):
print('=' * 10)
print(f'generating timings for a period of {period} years')
df_orig = generate_dummy_data(period)
df_orig.info(verbose=False)
repeats = 1
for method_name, method in methods:
result = pd.DataFrame()
def my_method():
"""
This looks a bit icky, but is the best way I found to make sure the values are really changed,
and not just on a copy of a DataFrame
"""
nonlocal result
result = method(df_orig)
t = timeit.Timer(my_method).timeit(number=repeats)
assert result.iloc[3, 3] == 1
print(f'{method_name} took {t / repeats} seconds')
yield (method_name, {'time': t, 'memory': result.memory_usage(deep=True).sum()/1024})
periods = [0.03, 0.1, 0.3, 1, 3]
results = {period: dict(get_timings(period, methods)) for period in periods}
print(results)
timings_dict = {period: {k: v['time'] for k, v in result.items()} for period, result in results.items()}
df = pd.DataFrame.from_dict(timings_dict)
df.transpose().plot(logy=True).figure.savefig('test.png')
0.03 0.1 0.3 1.0 3.0 assign_applymap 0.001989 0.009862 0.018018 0.105569 0.549511 assign_vectorized 0.002974 0.008428 0.035994 0.162565 3.810138 assign_index 0.013717 0.137134 1.288852 14.190128 111.102662 assign_column2 0.026260 0.186588 1.664345 19.204453 143.103077 assign_column 0.016811 0.212158 1.838733 21.053627 153.827845 assign_itertuples 0.025130 0.249886 2.125968 24.639593 185.975111 assign_iloc_product_range 0.026982 0.247069 2.199019 23.902244 186.548500 assign_iloc_product 0.021225 0.233454 2.437183 25.143673 218.849143 assign_loc_product 0.018743 0.290104 2.515379 32.778794 258.244436 assign_loc 0.029050 0.349551 2.822797 32.087433 294.052933 assign_original 0.034315 0.337207 2.714154 30.361072 332.327008
如果您可以使用矢量化,请执行此操作。根据计算,您可以使用其他方法。如果你只需要使用的值,applymap
似乎最快。如果您还需要索引和/或列,请使用列
如果你无法进行向量化,那么df[column][index] = x
的工作速度最快,迭代df.iteritems()
作为近距离的列