所以可以说我在熊猫中有一个具有m行n列的DataFrame。再说一遍,我想反转列的顺序,这可以通过以下代码完成:
df_reversed = df[df.columns[::-1]]
此操作的大复杂度是多少?我假设这将取决于列数,但还会取决于行数吗?
答案 0 :(得分:7)
我不知道熊猫是如何实现的,但是我确实进行了经验测试。我在Jupyter笔记本中运行以下代码以测试操作速度:
def get_dummy_df(n):
return pd.DataFrame({'a': [1,2]*n, 'b': [4,5]*n, 'c': [7,8]*n})
df = get_dummy_df(100)
print df.shape
%timeit df_r = df[df.columns[::-1]]
df = get_dummy_df(1000)
print df.shape
%timeit df_r = df[df.columns[::-1]]
df = get_dummy_df(10000)
print df.shape
%timeit df_r = df[df.columns[::-1]]
df = get_dummy_df(100000)
print df.shape
%timeit df_r = df[df.columns[::-1]]
df = get_dummy_df(1000000)
print df.shape
%timeit df_r = df[df.columns[::-1]]
df = get_dummy_df(10000000)
print df.shape
%timeit df_r = df[df.columns[::-1]]
输出为:
(200, 3)
1000 loops, best of 3: 419 µs per loop
(2000, 3)
1000 loops, best of 3: 425 µs per loop
(20000, 3)
1000 loops, best of 3: 498 µs per loop
(200000, 3)
100 loops, best of 3: 2.66 ms per loop
(2000000, 3)
10 loops, best of 3: 25.2 ms per loop
(20000000, 3)
1 loop, best of 3: 207 ms per loop
如您所见,在前三种情况下,操作的开销是大部分时间(400-500µs),但是从第四种情况开始,所需的时间开始与操作量成正比。数据,每次都增加一个数量级。
所以,假设 n 也必须有一个比例,看来我们正在处理O(m * n)
答案 1 :(得分:3)
Big O的复杂度(从Pandas 0.24开始)为m*n
,其中m
是列数,n
是行数。请注意,这是将DataFrame.__getitem__
方法(又名[]
)与Index
(see relevant code, with other types that would trigger a copy)结合使用。
以下是有用的堆栈跟踪:
<ipython-input-4-3162cae03863>(2)<module>()
1 columns = df.columns[::-1]
----> 2 df_reversed = df[columns]
pandas/core/frame.py(2682)__getitem__()
2681 # either boolean or fancy integer index
-> 2682 return self._getitem_array(key)
2683 elif isinstance(key, DataFrame):
pandas/core/frame.py(2727)_getitem_array()
2726 indexer = self.loc._convert_to_indexer(key, axis=1)
-> 2727 return self._take(indexer, axis=1)
2728
pandas/core/generic.py(2789)_take()
2788 axis=self._get_block_manager_axis(axis),
-> 2789 verify=True)
2790 result = self._constructor(new_data).__finalize__(self)
pandas/core/internals.py(4539)take()
4538 return self.reindex_indexer(new_axis=new_labels, indexer=indexer,
-> 4539 axis=axis, allow_dups=True)
4540
pandas/core/internals.py(4421)reindex_indexer()
4420 new_blocks = self._slice_take_blocks_ax0(indexer,
-> 4421 fill_tuple=(fill_value,))
4422 else:
pandas/core/internals.py(1254)take_nd()
1253 new_values = algos.take_nd(values, indexer, axis=axis,
-> 1254 allow_fill=False)
1255 else:
> pandas/core/algorithms.py(1658)take_nd()
1657 import ipdb; ipdb.set_trace()
-> 1658 func = _get_take_nd_function(arr.ndim, arr.dtype, out.dtype, axis=axis,
1659 mask_info=mask_info)
1660 func(arr, indexer, out, fill_value)
在func
中对L1660的pandas/core/algorithms
调用最终会调用复杂性为O(m * n)
的cython函数。这是原始数据中的数据被复制到out
中的位置。 out
按相反顺序包含原始数据的副本。
inner_take_2d_axis0_template = """\
cdef:
Py_ssize_t i, j, k, n, idx
%(c_type_out)s fv
n = len(indexer)
k = values.shape[1]
fv = fill_value
IF %(can_copy)s:
cdef:
%(c_type_out)s *v
%(c_type_out)s *o
#GH3130
if (values.strides[1] == out.strides[1] and
values.strides[1] == sizeof(%(c_type_out)s) and
sizeof(%(c_type_out)s) * n >= 256):
for i from 0 <= i < n:
idx = indexer[i]
if idx == -1:
for j from 0 <= j < k:
out[i, j] = fv
else:
v = &values[idx, 0]
o = &out[i, 0]
memmove(o, v, <size_t>(sizeof(%(c_type_out)s) * k))
return
for i from 0 <= i < n:
idx = indexer[i]
if idx == -1:
for j from 0 <= j < k:
out[i, j] = fv
else:
for j from 0 <= j < k:
out[i, j] = %(preval)svalues[idx, j]%(postval)s
"""
请注意,在上面的模板函数中,有一个使用memmove
的路径(在这种情况下采用的路径,因为我们是从int64
映射到int64
和维度的输出相同,因为我们只是在切换索引)。请注意,memmove
is still O(n)与它必须复制的字节数成正比,尽管可能比直接写入索引要快。
答案 2 :(得分:-2)
我使用big_O
拟合库here进行了实证测试
注意:所有测试均在自变量扫描6个数量级上进行(即
rows
从10
到10^6
相对于column
的常量3
大小,columns
从10
到10^6
相对于row
的常量10
大小
结果表明,columns
中.columns[::-1]
反向操作DataFrame
的复杂度是
O(n^3)
,其中n是rows
的数量O(n^3)
,其中n是columns
的数量先决条件:您将需要使用终端命令
安装big_o()
pip install big_o
代码
import big_o
import pandas as pd
import numpy as np
SWEAP_LOG10 = 6
COLUMNS = 3
ROWS = 10
def build_df(rows, columns):
# To isolated the creation of the DataFrame from the inversion operation.
narray = np.zeros(rows*columns).reshape(rows, columns)
df = pd.DataFrame(narray)
return df
def flip_columns(df):
return df[df.columns[::-1]]
def get_row_df(n, m=COLUMNS):
return build_df(1*10**n, m)
def get_column_df(n, m=ROWS):
return build_df(m, 1*10**n)
# infer the big_o on columns[::-1] operation vs. rows
best, others = big_o.big_o(flip_columns, get_row_df, min_n=1, max_n=SWEAP_LOG10,n_measures=SWEAP_LOG10, n_repeats=10)
# print results
print('Measuring .columns[::-1] complexity against rapid increase in # rows')
print('-'*80 + '\nBig O() fits: {}\n'.format(best) + '-'*80)
for class_, residual in others.items():
print('{:<60s} (res: {:.2G})'.format(str(class_), residual))
print('-'*80)
# infer the big_o on columns[::-1] operation vs. columns
best, others = big_o.big_o(flip_columns, get_column_df, min_n=1, max_n=SWEAP_LOG10,n_measures=SWEAP_LOG10, n_repeats=10)
# print results
print()
print('Measuring .columns[::-1] complexity against rapid increase in # columns')
print('-'*80 + '\nBig O() fits: {}\n'.format(best) + '-'*80)
for class_, residual in others.items():
print('{:<60s} (res: {:.2G})'.format(str(class_), residual))
print('-'*80)
结果
Measuring .columns[::-1] complexity against rapid increase in # rows
--------------------------------------------------------------------------------
Big O() fits: Cubic: time = -0.017 + 0.00067*n^3
--------------------------------------------------------------------------------
Constant: time = 0.032 (res: 0.021)
Linear: time = -0.051 + 0.024*n (res: 0.011)
Quadratic: time = -0.026 + 0.0038*n^2 (res: 0.0077)
Cubic: time = -0.017 + 0.00067*n^3 (res: 0.0052)
Polynomial: time = -6.3 * x^1.5 (res: 6)
Logarithmic: time = -0.026 + 0.053*log(n) (res: 0.015)
Linearithmic: time = -0.024 + 0.012*n*log(n) (res: 0.0094)
Exponential: time = -7 * 0.66^n (res: 3.6)
--------------------------------------------------------------------------------
Measuring .columns[::-1] complexity against rapid increase in # columns
--------------------------------------------------------------------------------
Big O() fits: Cubic: time = -0.28 + 0.009*n^3
--------------------------------------------------------------------------------
Constant: time = 0.38 (res: 3.9)
Linear: time = -0.73 + 0.32*n (res: 2.1)
Quadratic: time = -0.4 + 0.052*n^2 (res: 1.5)
Cubic: time = -0.28 + 0.009*n^3 (res: 1.1)
Polynomial: time = -6 * x^2.2 (res: 16)
Logarithmic: time = -0.39 + 0.71*log(n) (res: 2.8)
Linearithmic: time = -0.38 + 0.16*n*log(n) (res: 1.8)
Exponential: time = -7 * 1^n (res: 9.7)
--------------------------------------------------------------------------------