我想生成一个与corr()类似的DataFrame,但是具有不同的格式。
例如,假设我有一个DataFrame
import pandas as pd
pa = pd.DataFrame()
pa['john']=[2,3,4,5,6]
pa['june']=[4,6,7,8,2]
pa['kate']=[3,2,3,4,5]
Pandas具有corr()内置函数,该函数生成新的相关性DataFrame。所以如果我打电话给pa.corr()会返回我
john june kate
john 1.000000 -0.131306 0.832050
june -0.131306 1.000000 -0.437014
kate 0.832050 -0.437014 1.000000
我想生成一个新的DataFrame,但是使用不同的格式,例如
john june kate
john formula(john)*formula(john) formula(june)*formula(john) formula(kate)*formula(john)
june formula(john)*formula(june) formula(june)*formula(june) formula(kate)*formula(june)
kate formula(john)*formula(kate) formula(june)*formula(kate) formula(kate)*formula(kate)
其中,Formula()在一个DataFrame列上进行计算(例如,可以是Formula(pa ['john']) 我该怎么办?
答案 0 :(得分:1)
这是一种方法,不确定是否最简单
# random function
def formula(x,y):
return sum(x*y)
import numpy as np
# create a list with tuples with all columns crossings
l = [(x,y) for x in pa.columns for y in pa.columns]
#[('john', 'john'),
# ('john', 'june'),
# ('john', 'kate'),
# ('june', 'john'),
# ('june', 'june'),
# ('june', 'kate'),
# ('kate', 'john'),
# ('kate', 'june'),
# ('kate', 'kate')]
# create dataframe with all info
# x = first element in tuple = one of pa column name
# y = second element in tuple = one of pa column name
# values = formula(pa[x],pa[y])
df = pd.DataFrame({'x': [el[0] for el in l],
'y': [el[1] for el in l] ,
'values':[formula(pa[x],pa[y]) for x,y in l]} )
# x y values
#0 john john 90
#1 john june 106
#2 john kate 74
#3 june john 106
#4 june june 169
#5 june kate 87
#6 kate john 74
#7 kate june 87
#8 kate kate 63
# pivot df to obtain the format you want
table = pd.pivot_table(df, values='values', index=['x'],columns=['y'], aggfunc=np.sum).reset_index()
# y x john june kate
#0 john 90 106 74
#1 june 106 169 87
#2 kate 74 87 63