我有13列,303行/行我在健康患者和患者之间划分了303行我现在试图获取CSV文件中每列的平均值,以便健康患者和患者进行比较和对比。问题的最后一个例子就是这个,CSV文件的数字与本例中的平均数相同,但缺少数据的情况除外。
Please enter a training file name: train.csv
Total Lines Processed: 303
Total Healthy Count: 164
Total Ill Count: 139
Averages of Healthy Patients:
[52.59, 0.56, 2.79, 129.25, 242.64, 0.14, 0.84, 158.38, 0.14, 0.59, 1.41, 0.27, 3.77, 0.00]
Averages of Ill Patients:
[56.63, 0.82, 3.59, 134.57, 251.47, 0.16, 1.17, 139.26, 0.55, 1.57, 1.83, 1.13, 5.80, 2.04]
Seperation Values are:
[54.61, 0.69, 3.19, 131.91, 247.06, 0.15, 1.00, 148.82, 0.34, 1.08, 1.62, 0.70, 4.79, 1.02]
我仍然有很长的路要走我的代码,我只是在寻找一种简单的方法来获得患者的平均值。我目前的方法只获得第13列,但我需要所有13个如上所述。任何有关我应该尝试解决这个问题的方法的帮助将不胜感激。
import csv
#turn csv files into a list of lists
with open('train.csv') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
csv_data = list(reader)
i_list = []
for row in csv_data:
if (row and int(row[13]) > 0):
i_list.append(int(row[13]))
H_list = []
for row in csv_data:
if (row and int(row[13]) <= 0):
H_list.append(int(row[13]))
Icount = len(i_list)
IPavg = sum(i_list)/len(i_list)
Hcount = len(H_list)
HPavg = sum(H_list)/len(H_list)
file = open("train.csv")
numline = len(file.readlines())
print(numline)
print("Total amount of healthy patients " + str(Icount))
print("Total amount of ill patients " + str(Hcount))
print("Averages of healthy patients " + str(HPavg))
print("Averages of ill patients " + str(IPavg))
我唯一的想法是做同样的事情,以获得第13行的平均值,但我不知道如何让健康的病人与病人分开。
答案 0 :(得分:2)
如果你想要每列的平均值,那么在你阅读文件时最简单的方法就是一次处理所有这些 - 这并不困难。您没有指定您正在使用的Python版本,但以下内容应该同时适用(尽管可以针对其中一个进行优化)。
import csv
NUMCOLS = 13
with open('train.csv') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
# initialize totals
Icount = 0
Hcount = 0
H_col_totals = [0.0 for _ in range(NUMCOLS)] # init to floating pt value for Py 2
I_col_totals = [0.0 for _ in range(NUMCOLS)] # init to floating pt value for Py 2
# read and process file
for row in reader:
if row: # non-blank line?
# update running total for each column
row = list(map(int, row))
for col in range(NUMCOLS):
if row[col] > 0:
Icount += 1
I_col_totals[col] += row[col]
else:
Hcount += 1
H_col_totals[col] += row[col]
# compute average of data in each column
if Hcount < 1: # avoid dividing by zero
HPavgs = [0.0 for _ in range(NUMCOLS)]
else:
HPavgs = [H_col_totals[col]/Hcount for col in range(NUMCOLS)]
if Icount < 1: # avoid dividing by zero
IPavgs = [0.0 for _ in range(NUMCOLS)]
else:
IPavgs = [I_col_totals[col]/Icount for col in range(NUMCOLS)]
print("Total number of healthy patients: {}".format(Hcount))
print("Total number of ill patients: {}".format(Icount))
print("Averages of healthy patients: " +
", ".join(format(HPavgs[col], ".2f") for col in range(NUMCOLS)))
print("Averages of ill patients: " +
", ".join(format(IPavgs[col], ".2f") for col in range(NUMCOLS)))
答案 1 :(得分:1)
为什么不使用pandas模块?
完成你想要的东西要容易得多。
In [42]: import pandas as pd
In [43]: import numpy as np
In [44]: df = pd.DataFrame(np.random.randn(10, 4))
In [45]: df
Out[45]:
0 1 2 3
0 1.290657 -0.376132 -0.482188 1.117486
1 -0.620332 -0.247143 0.214548 -0.975472
2 1.803212 -0.073028 0.224965 0.069488
3 -0.249340 0.491075 0.083451 0.282813
4 -0.477317 0.059482 0.867047 -0.656830
5 0.117523 0.089099 -0.561758 0.459426
6 -0.173780 -0.066054 -0.943881 -0.301504
7 1.250235 -0.949350 -1.119425 1.054016
8 1.031764 -1.470245 -0.976696 0.579424
9 0.300025 1.141415 1.503518 1.418005
In [46]: df.mean()
Out[46]:
0 0.427265
1 -0.140088
2 -0.119042
3 0.304685
dtype: float64
在你的情况下,你可以尝试:
In [47]: df = pd.read_csv('yourfile.csv')