我想对许多具有多个列的数据集进行t统计计算。
要指定列,我使用columns = df.columns
然后我将数据集存储在一个列表conds = [a, b, c, d, e, f, g, h]
然后我想将结果附加到空白列表results = []
这是我使用的以下代码:
from scipy import stats
results = []
columns = df.columns
conds = [a, b, c, d, e, f, g, h]
for col in columns:
for cond in conds:
t_statistic, p_value = stats.ttest_1samp(conds[col], 0)
results.append(t_statistic)
t统计信息存储在所有数据集中所有列的列表中。
我想做的,但是不确定如何做,就是分配列名并将每个数据集的结果存储在自己的列表/ DataFrame中
任何建议都会很有帮助!
答案 0 :(得分:1)
这是一种应该为您提供所需内容的方法:
# Generate sample data
def data_gen():
df = pd.DataFrame(np.random.rand(10,10), columns=list('ABCDEFGHIJ'))
return df
a = data_gen()
b = data_gen()
c = data_gen()
d = data_gen()
e = data_gen()
f = data_gen()
g = data_gen()
h = data_gen()
df = a.copy()
from scipy import stats
results = {} # Initialize dictionary
columns = df.columns
conds = [a, b, c, d, e, f, g, h]
df_names = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
i = -1
for cond in conds:
i += 1
name = df_names[i]
results[name] = [] # Initialize list
for col in columns:
t_statistic, p_value = stats.ttest_1samp(cond[col], 0) # Removed "s"
results[name].append(t_statistic)
df_stats = pd.DataFrame.from_dict(results)
df_stats.index.name = 'Columns'
df_stats.columns.name = 'Data Frames'
print(df_stats)
Data Frames a b c d e f g h
Columns
0 4.868814 4.623735 4.238881 4.679973 5.450708 6.512495 6.080255 7.345525
1 4.697972 6.964373 6.382984 6.880155 5.987408 10.999835 3.931329 4.771808
2 2.965649 7.024299 4.748638 11.069944 4.176942 7.211100 5.258628 5.869208
3 3.635906 4.797787 6.842129 4.891177 4.741151 6.576623 10.419799 5.335392
4 4.834541 6.256189 4.916233 6.783839 5.716030 7.206449 5.924025 4.072350
5 5.711664 6.880239 6.041098 6.373754 3.322898 4.781460 9.376661 5.085084
6 6.808170 6.152167 7.111449 4.644709 7.156351 5.384771 6.964388 4.855696
7 4.310228 4.564960 4.386858 3.877932 5.384289 15.098405 6.540945 5.633237
8 4.462443 5.181235 5.844863 5.448389 4.600004 4.617082 5.472338 7.359407
9 4.742538 6.812944 7.289546 5.858223 4.264142 5.728580 5.606259 6.936728