我有一个DataFrame,其数组列表为一列。
void
输出:
import pandas as pd
v = [1, 2, 3, 4, 5, 6, 7]
v1 = [1, 0, 0, 0, 0, 0, 0]
v2 = [0, 1, 0, 0, 1, 0, 0]
v3 = [1, 1, 0, 0, 0, 0, 1]
df = pd.DataFrame({'A': [v1, v2, v3]})
print df
我想针对单个向量v为每行df.A做一个pd.Series.corr。 我目前正在对df.A进行循环并实现它。这很慢。
预期产出:
A
0 [1, 0, 0, 0, 0, 0, 0]
1 [0, 1, 0, 0, 1, 0, 0]
2 [1, 1, 0, 0, 0, 0, 1]
答案 0 :(得分:2)
使用corrwith
,但如果效果很重要,Divakar's anwer应该更快:
df['new'] = pd.DataFrame(df['A'].values.tolist()).corrwith(pd.Series(v), axis=1)
print (df)
A new
0 [1, 0, 0, 0, 0, 0, 0] -0.612372
1 [0, 1, 0, 0, 1, 0, 0] -0.158114
2 [1, 1, 0, 0, 0, 0, 1] -0.288675
答案 1 :(得分:2)
这是一个使用NumPy工具的相关性定义,用于corr2_coeff_rowwise
的性能 -
a = np.array(df.A.tolist()) # or np.vstack(df.A.values)
df['B'] = corr2_coeff_rowwise(a, np.asarray(v)[None])
运行时测试 -
案例#1:1000行
In [59]: df = pd.DataFrame({'A': [np.random.randint(0,9,(7)) for i in range(1000)]})
In [60]: v = np.random.randint(0,9,(7)).tolist()
# @jezrael's soln
In [61]: %timeit df['new'] = pd.DataFrame(df['A'].values.tolist()).corrwith(pd.Series(v), axis=1)
10 loops, best of 3: 142 ms per loop
In [62]: %timeit df['B'] = corr2_coeff_rowwise(np.array(df.A.tolist()), np.asarray(v)[None])
1000 loops, best of 3: 461 µs per loop
案例#2:10000行
In [63]: df = pd.DataFrame({'A': [np.random.randint(0,9,(7)) for i in range(10000)]})
In [64]: v = np.random.randint(0,9,(7)).tolist()
# @jezrael's soln
In [65]: %timeit df['new'] = pd.DataFrame(df['A'].values.tolist()).corrwith(pd.Series(v), axis=1)
1 loop, best of 3: 1.38 s per loop
In [66]: %timeit df['B'] = corr2_coeff_rowwise(np.array(df.A.tolist()), np.asarray(v)[None])
100 loops, best of 3: 3.05 ms per loop