Question

我有13列，303行/行我在健康患者和患者之间划分了303行我现在试图获取CSV文件中每列的平均值，以便健康患者和患者进行比较和对比。问题的最后一个例子就是这个，CSV文件的数字与本例中的平均数相同，但缺少数据的情况除外。

Please enter a training file name: train.csv
Total Lines Processed: 303
Total Healthy Count: 164
Total Ill Count: 139
Averages of Healthy Patients:
[52.59, 0.56, 2.79, 129.25, 242.64, 0.14, 0.84, 158.38, 0.14, 0.59, 1.41, 0.27, 3.77, 0.00]
Averages of Ill Patients:
[56.63, 0.82, 3.59, 134.57, 251.47, 0.16, 1.17, 139.26, 0.55, 1.57, 1.83, 1.13, 5.80, 2.04]
Seperation Values are:
[54.61, 0.69, 3.19, 131.91, 247.06, 0.15, 1.00, 148.82, 0.34, 1.08, 1.62, 0.70, 4.79, 1.02]

我仍然有很长的路要走我的代码，我只是在寻找一种简单的方法来获得患者的平均值。我目前的方法只获得第13列，但我需要所有13个如上所述。任何有关我应该尝试解决这个问题的方法的帮助将不胜感激。

import csv
#turn csv files into a list of lists
with open('train.csv') as csvfile:
     reader = csv.reader(csvfile, delimiter=',')
     csv_data = list(reader)

i_list = []
for row in csv_data:
    if (row and int(row[13]) > 0):
        i_list.append(int(row[13]))
H_list = []
for row in csv_data:
    if (row and int(row[13]) <= 0):
        H_list.append(int(row[13]))

Icount = len(i_list)
IPavg = sum(i_list)/len(i_list)
Hcount = len(H_list)
HPavg = sum(H_list)/len(H_list)
file = open("train.csv")
numline = len(file.readlines())

print(numline)
print("Total amount of healthy patients " + str(Icount))
print("Total amount of ill patients " + str(Hcount))
print("Averages of healthy patients " + str(HPavg))
print("Averages of ill patients " + str(IPavg))

我唯一的想法是做同样的事情，以获得第13行的平均值，但我不知道如何让健康的病人与病人分开。

Answer 1

如果你想要每列的平均值，那么在你阅读文件时最简单的方法就是一次处理所有这些 - 这并不困难。您没有指定您正在使用的Python版本，但以下内容应该同时适用（尽管可以针对其中一个进行优化）。

import csv

NUMCOLS = 13

with open('train.csv') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    # initialize totals
    Icount = 0
    Hcount = 0
    H_col_totals = [0.0 for _ in range(NUMCOLS)]  # init to floating pt value for Py 2
    I_col_totals = [0.0 for _ in range(NUMCOLS)]  # init to floating pt value for Py 2
    # read and process file
    for row in reader:
        if row:  # non-blank line?
            # update running total for each column
            row = list(map(int, row))
            for col in range(NUMCOLS):
                if row[col] > 0:
                    Icount += 1
                    I_col_totals[col] += row[col]
                else:
                    Hcount += 1
                    H_col_totals[col] += row[col]

# compute average of data in each column
if Hcount < 1:  # avoid dividing by zero
    HPavgs = [0.0 for _ in range(NUMCOLS)]
else:
    HPavgs = [H_col_totals[col]/Hcount for col in range(NUMCOLS)]

if Icount < 1:  # avoid dividing by zero
    IPavgs = [0.0 for _ in range(NUMCOLS)]
else:
    IPavgs = [I_col_totals[col]/Icount for col in range(NUMCOLS)]

print("Total number of healthy patients: {}".format(Hcount))
print("Total number of ill patients: {}".format(Icount))
print("Averages of healthy patients: " +
      ", ".join(format(HPavgs[col], ".2f") for col in range(NUMCOLS)))
print("Averages of ill patients: " +
      ", ".join(format(IPavgs[col], ".2f") for col in range(NUMCOLS)))

Answer 2

为什么不使用pandas模块？

完成你想要的东西要容易得多。

In [42]: import pandas as pd

In [43]: import numpy as np

In [44]: df = pd.DataFrame(np.random.randn(10, 4))

In [45]: df
Out[45]:
          0         1         2         3
0  1.290657 -0.376132 -0.482188  1.117486
1 -0.620332 -0.247143  0.214548 -0.975472
2  1.803212 -0.073028  0.224965  0.069488
3 -0.249340  0.491075  0.083451  0.282813
4 -0.477317  0.059482  0.867047 -0.656830
5  0.117523  0.089099 -0.561758  0.459426
6 -0.173780 -0.066054 -0.943881 -0.301504
7  1.250235 -0.949350 -1.119425  1.054016
8  1.031764 -1.470245 -0.976696  0.579424
9  0.300025  1.141415  1.503518  1.418005

In [46]: df.mean()
Out[46]:
0    0.427265
1   -0.140088
2   -0.119042
3    0.304685
dtype: float64

在你的情况下，你可以尝试：

In [47]: df = pd.read_csv('yourfile.csv')

如何从CSV文件中获取每列的平均值而不是行？

2 个答案: