我有一个pandas数据帧,其中包含< 30K行和7列,我试图将4列的相关性与第5列相关联。问题是,我想用大量的数据集做这个,但这需要大约40秒才能运行。这是我的代码:
df_a = dfr[['id', 'state', 'perform', 'A']].groupby(['id', 'state']).corr().ix[1::2][['A']].reset_index(2).drop('level_2', axis=1)
df_b = dfr[['id', 'state', 'perform', 'B']].groupby(['id', 'state']).corr().ix[1::2][['B']].reset_index(2).drop('level_2', axis=1)
df_c = dfr[['id', 'state', 'perform', 'C']].groupby(['id', 'state']).corr().ix[1::2][['C']].reset_index(2).drop('level_2', axis=1)
df_d = dfr[['id', 'state', 'perform', 'D']].groupby(['id', 'state']).corr().ix[1::2][['D']].reset_index(2).drop('level_2', axis=1)
df = df_a.merge(df_b, left_index=True, right_index=True)
df = df.merge(df_c, left_index=True, right_index=True)
df = df.merge(df_d, left_index=True, right_index=True)
示例数据如下所示:
ID State perform A B C D
234 AK 75.8456 1 0 0 0
284 MN 78.6752 0 0 1 0
有没有人知道如何更快地提高这一点,或者更好地实施这种方法?
谢谢!
答案 0 :(得分:1)
在原始代码和我的代码之间进行快速比较,以下是时差:
%%timeit
# data
'''
ID State perform A B C D
234 AK 75.8456 1 0 0 0
284 MN 78.6752 0 0 1 0
'''
# make dataframe
dfr = pd.read_clipboard()
df_a = dfr[['ID', 'State', 'perform', 'A']].groupby(['ID', 'State']).corr().ix[1::2][['A']].reset_index(2).drop('level_2', axis=1)
df_b = dfr[['ID', 'State', 'perform', 'B']].groupby(['ID', 'State']).corr().ix[1::2][['B']].reset_index(2).drop('level_2', axis=1)
df_c = dfr[['ID', 'State', 'perform', 'C']].groupby(['ID', 'State']).corr().ix[1::2][['C']].reset_index(2).drop('level_2', axis=1)
df_d = dfr[['ID', 'State', 'perform', 'D']].groupby(['ID', 'State']).corr().ix[1::2][['D']].reset_index(2).drop('level_2', axis=1)
df = df_a.merge(df_b, left_index=True, right_index=True)
df = df.merge(df_c, left_index=True, right_index=True)
df = df.merge(df_d, left_index=True, right_index=True)
%%timeit
# data
'''
ID State perform A B C D
234 AK 75.8456 1 0 0 0
284 MN 78.6752 0 0 1 0
'''
# make dataframe
df = pd.read_clipboard()
# make other dfs
df_a = df.loc[:, :'A'].groupby([
'ID',
'State'
]).corr().iloc[1::2][['A']].reset_index(2, drop = True)
df_b = df.loc[:, [
'ID',
'State',
'perform',
'B'
]].groupby([
'ID',
'State'
]).corr().iloc[1::2][['B']].reset_index(2, drop = True)
df_c = df.loc[:, [
'ID',
'State',
'perform',
'C'
]].groupby([
'ID',
'State'
]).corr().iloc[1::2][['C']].reset_index(2, drop = True)
df_d = df.loc[:, [
'ID',
'State',
'perform',
'D'
]].groupby([
'ID',
'State'
]).corr().iloc[1::2][['D']].reset_index(2, drop = True)
# concat them together
pd.concat([df_a, df_b, df_c, df_d], axis = 1)
但差异可能仍然可以忽略不计。
决定尝试使用for
循环删除重复的代码。在运行时间方面有所改善:
%%timeit
# data
'''
ID State perform A B C D
234 AK 75.8456 1 0 0 0
284 MN 78.6752 0 0 1 0
'''
# make dataframe
df = pd.read_clipboard()
# make list of letter columns
letters = ['A', 'B', 'C', 'D']
# store corr() dfs in list for concatenation
list_of_dfs = []
for letter in letters:
list_of_dfs.append(df.loc[:, [
'ID',
'State',
'perform',
letter
]].groupby([
'ID',
'State'
]).corr().iloc[1::2][[letter]].reset_index(2, drop = True))
# concat them together
pd.concat(list_of_dfs, axis = 1)
答案 1 :(得分:1)
pandas corr速度非常慢的原因是它考虑了NAN:基本上是一个cython for循环。
如果您的数据没有NAN,那么numpy.corrcoef 快得多。
答案 2 :(得分:1)
我想在这里第二个玛明的答案...
numpy相对于熊猫来说速度非常快。我进行了一些测试,以查看速度有多快... 大量速度提高:
import timeit
import numpy as np
import pandas as pd
holdings_array = np.arange(40, 100, 20)
dates_array = [200, 800]
calc = pd.DataFrame(
index=pd.MultiIndex.from_product(
[['Correlation', 'Returns'], ['pandas', 'numpy', 'pandas.values'], dates_array],
names=['Method', 'Library', 'Shape']),
columns=holdings_array
).unstack(level=-1)
print("Checking pandas vs numpy... (ms)")
# PANDAS
for num_holdings in holdings_array:
for num_dates in dates_array:
df = pd.DataFrame(np.random.random(size=[num_dates, num_holdings]))
ls = np.random.random(size=num_holdings)
calc.loc[('Correlation', 'pandas'), (num_holdings, num_dates)] \
= str(np.round(timeit.timeit('df.corr()', number=100, globals=globals()) * 10, 3))
calc.loc[('Returns', 'pandas'), (num_holdings, num_dates)] \
= str(np.round(timeit.timeit('(df*ls).sum(axis=1)', number=100, globals=globals()) * 10, 3))
# # NUMPY
# for x, num_holdings in enumerate(holdings_array):
# for y, num_dates in enumerate(dates_array):
# df = np.array(np.random.random(size=[num_dates, num_holdings]))
# ls = np.random.random(size=num_holdings)
# calc.loc[('Correlation', 'numpy'), (num_holdings, num_dates)] \
# = str(np.round(timeit.timeit('np.corrcoef(df)', number=100, globals=globals()) * 10, 3))
# calc.loc[('Returns', 'numpy'), (num_holdings, num_dates)] \
# = str(np.round(timeit.timeit('(df*ls).sum(axis=1)', number=100, globals=globals()) * 10, 3))
# PANDAS.VALUES WITH NP FUNCTIONS
for num_holdings in holdings_array:
for num_dates in dates_array:
df = pd.DataFrame(np.random.random(size=[num_dates, num_holdings]))
ls = np.random.random(size=num_holdings)
calc.loc[('Correlation', 'pandas.values'), (num_holdings, num_dates)] \
= str(np.round(timeit.timeit('np.corrcoef(df.values, rowvar=False)', number=100, globals=globals()) * 10, 3))
calc.loc[('Returns', 'pandas.values'), (num_holdings, num_dates)] \
= str(np.round(timeit.timeit('(df.values*ls).sum(axis=1)', number=100, globals=globals()) * 10, 3))
print(f"Results: \n{calc.to_string()}")
如果可能的话,如果要对熊猫df进行矢量计算,请将其更改为df.values并运行np操作
例如,我可以将df.corr()更改为np.corrcoef(df.values,rowvar = False)(注意:rowvar = False重要,因此形状正确),对于大型操作,您将看到10x,100x速度。不平凡。
答案 3 :(得分:0)
虽然不是最好的解决方案,但这对我有用,并使运行时间比以前的52秒减少了4.8秒。
我最终在大熊猫中分组,然后使用numpy运行相关性。
groups = df.groupby(['course_id', 'activity_id'])
np_arr = []
for (cor_id,act_id), group in groups:
np_arr.append([cor_id, act_id,
np.corrcoef(group.A.as_matrix(), group.perform.as_matrix())[0,1],
np.corrcoef(group.B.as_matrix(), group.perform.as_matrix())[0,1],
np.corrcoef(group.C.as_matrix(), group.perform.as_matrix())[0,1],
np.corrcoef(group.D.as_matrix(), group.perform.as_matrix())[0,1]])
df = pd.DataFrame(data=np.array(np_arr), columns=['course_id', 'activity_id', 'A', 'B', 'C', 'D'])
这有效地减少了我的运行时间,我将使用cython输入我的变量以进一步提高速度。