Pandas系列与单个载体的相关性

时间:2017-11-25 09:58:13

标签: performance pandas numpy dataframe

我有一个DataFrame,其数组列表为一列。

void

输出:

  import pandas as pd

  v = [1, 2, 3, 4, 5, 6, 7]
  v1 = [1, 0, 0, 0, 0, 0, 0]
  v2 = [0, 1, 0, 0, 1, 0, 0]
  v3 = [1, 1, 0, 0, 0, 0, 1]

  df = pd.DataFrame({'A': [v1, v2, v3]})

  print df

我想针对单个向量v为每行df.A做一个pd.Series.corr。 我目前正在对df.A进行循环并实现它。这很慢。

预期产出:

                       A
0  [1, 0, 0, 0, 0, 0, 0]
1  [0, 1, 0, 0, 1, 0, 0]
2  [1, 1, 0, 0, 0, 0, 1]

2 个答案:

答案 0 :(得分:2)

使用corrwith,但如果效果很重要,Divakar's anwer应该更快:

df['new'] = pd.DataFrame(df['A'].values.tolist()).corrwith(pd.Series(v), axis=1)
print (df)
                       A       new
0  [1, 0, 0, 0, 0, 0, 0] -0.612372
1  [0, 1, 0, 0, 1, 0, 0] -0.158114
2  [1, 1, 0, 0, 0, 0, 1] -0.288675

答案 1 :(得分:2)

这是一个使用NumPy工具的相关性定义,用于corr2_coeff_rowwise的性能 -

a = np.array(df.A.tolist()) # or np.vstack(df.A.values)
df['B'] = corr2_coeff_rowwise(a, np.asarray(v)[None])

运行时测试 -

案例#1:1000行

In [59]: df = pd.DataFrame({'A': [np.random.randint(0,9,(7)) for i in range(1000)]})

In [60]: v = np.random.randint(0,9,(7)).tolist()

# @jezrael's soln
In [61]: %timeit df['new'] = pd.DataFrame(df['A'].values.tolist()).corrwith(pd.Series(v), axis=1)
10 loops, best of 3: 142 ms per loop

In [62]: %timeit df['B'] = corr2_coeff_rowwise(np.array(df.A.tolist()), np.asarray(v)[None])
1000 loops, best of 3: 461 µs per loop

案例#2:10000行

In [63]: df = pd.DataFrame({'A': [np.random.randint(0,9,(7)) for i in range(10000)]})

In [64]: v = np.random.randint(0,9,(7)).tolist()

# @jezrael's soln
In [65]: %timeit df['new'] = pd.DataFrame(df['A'].values.tolist()).corrwith(pd.Series(v), axis=1)
1 loop, best of 3: 1.38 s per loop

In [66]: %timeit df['B'] = corr2_coeff_rowwise(np.array(df.A.tolist()), np.asarray(v)[None])
100 loops, best of 3: 3.05 ms per loop