使用Python从CSV文件中查找标准偏差

时间:2015-04-15 03:07:58

标签: python csv

我有一个名为'salaries.csv'的CSV文件。文件内容如下:

City,Job,Salary
Delhi,Doctors,500
Delhi,Lawyers,400
Delhi,Plumbers,100
London,Doctors,800
London,Lawyers,700
London,Plumbers,300
Tokyo,Doctors,900
Tokyo,Lawyers,800
Tokyo,Plumbers,400
Lawyers,Doctors,300
Lawyers,Lawyers,400
Lawyers,Plumbers,500
Hong Kong,Doctors,1800
Hong Kong,Lawyers,1100
Hong Kong,Plumbers,1000
Moscow,Doctors,300
Moscow,Lawyers,200
Moscow,Plumbers,100
Berlin,Doctors,800
Berlin,Plumbers,900
Paris,Doctors,900
Paris,Lawyers,800
Paris,Plumbers,500
Paris,Dog catchers,400

我需要打印每个职业的工资标准差 这是Python的旧版本。不能使用统计数据和numpy。

from __future__ import with_statement
import math
import csv
with open("salaries.csv") as f:
  def average(f): return sum(f) * 1.0 / len(f)
variance = map(lambda x: (x - avg)**2, f)
standard_deviation = math.sqrt(average(variance))
print standard_deviation

有人可以帮助我,我是蟒蛇领域的新手。

Error : TypeError('argument 2 to map() must support iteration',)

输出应为

Plumbers 311 Lawyers 286 Doctors 448

4 个答案:

答案 0 :(得分:1)

一些注意事项:

  1. Python中有内置函数来获取数字列表的长度,最小值和最大值(lenminmax,分别地)。

  2. 如果您使用的是Python> = 3.4.0,则会有一个名为statistics的模块,可帮助您计算列表的均值和标准差。

  3. 在salaries.csv旁边创建stdev.py文件。

    from statistics import mean, stdev
    f = open("salaries.csv", 'r')
    
    # Remove the first line City,Job,Salary
    f.readline()
    
    # Create the list of salaries 
    salaries = []
    for line in f.readlines():
      # After splitting the line, take the last element, remove extra spaces and cast it to int.
      value = int(line.split(',')[-1].strip())
      # Add the value to the salaries list.
      salaries.append(value)
    # min and max return the minimum and the maximum value of the list.
    print min(salaries), max(salaries)   
    print mean(salaries), stdev(salaries)  
    f.close()
    

    对于Python 2.x:

    from __future__ import with_statement
    from math import sqrt
    with open('salaries.csv') as f:
      f.readline()
      # Create the list of salaries 
      salaries = []
      for line in f.readlines():
        value = int(line.split(',')[-1].strip())
        salaries.append(value)
      print min(salaries), max(salaries)   
      n = float(len(salaries))
      mean = sum(salaries)/n
      stdev = 0
      for value in salaries:
        stdev += (value - mean)**2
      stdev = sqrt(stdev/(n))
      print mean, stdev
    

答案 1 :(得分:1)

您可以为每个文件创建字典,并将工资列表映射到专业。然后在最后进行计算,使用自己的函数或numpy.mean和numpy.std:

>>> import csv
>>> from collections import defaultdict
>>> from numpy import std, mean
>>>
>>> profession_to_salaries = defaultdict(list)
>>>
>>> with open('salaries.csv', 'rb') as csvfile:
...   reader = csv.DictReader(csvfile)
...   for row in reader:
...     profession_to_salaries[row['Job']].append(float(row['Salary']))
...
>>> for profession, salaries in profession_to_salaries.items():
...   print profession, min(salaries), max(salaries), mean(salaries), std(salaries)
...
Plumbers 100.0 1000.0 475.0 311.24748995
Lawyers 200.0 1100.0 628.571428571 286.427680797
Dog catchers 400.0 400.0 400.0 0.0
Doctors 300.0 1800.0 787.5 448.434777866

for python 2.4:

>>> from __future__ import with_statement
>>> import csv
>>>
>>> def mean(lst):
...     return sum(lst) * 1.0 / len(lst)
...
>>> def variance(lst):
...     m = mean(lst)
...     return [ (x - m) ** 2 for x in lst ]
...
>>> def std(lst):
...     return mean(variance(lst))**0.5
...
>>> profession_to_salaries = {}
>>>
>>> with open('salaries.csv', 'rb') as csvfile:
...     reader = csv.DictReader(csvfile)
...     for row in reader:
...         profession = row['Job']
...         if not profession in profession_to_salaries:
...             profession_to_salaries[row['Job']] = []
...         profession_to_salaries[row['Job']].append(float(row['Salary']))
...
>>> for profession, salaries in profession_to_salaries.items():
...     print profession, min(salaries), max(salaries), mean(salaries), std(salaries)
...
Plumbers 100.0 1000.0 475.0 311.24748995
Lawyers 200.0 1100.0 628.571428571 286.427680797
Dog catchers 400.0 400.0 400.0 0.0
Doctors 300.0 1800.0 787.5 448.434777866

答案 2 :(得分:1)

要获取每个职业的详细信息,请改为创建字典:

from __future__ import with_statement
import math

def get_stats(profession, salaries):   
  n = float(len(salaries))
  mean = sum(salaries)/n
  stdev = 0
  for value in salaries:
    stdev += (value - mean)**2
  stdev = math.sqrt(stdev/(n))
  print profession, min(salaries), max(salaries), mean, stdev

with open('salaries.csv') as f:
  f.readline()
  # Create the list of salaries 
  salaries = {} 
  for line in f.readlines():
    country, profession, value = line.split(',')
    value = int(value.strip())
    profession = profession.strip()
    if salaries.has_key(profession):
        salaries[profession].append(value)
    else:
        salaries[profession] = [value]
  for k,v in salaries.items():
    get_stats(k,v)  

答案 3 :(得分:0)

在代码中:

from __future__ import with_statement
import math
import csv


def std_dev(v):
    avg = sum([int(sal) for (city, job, sal) in v])/len(v)
    var = sum(map(lambda x: (int(x[-1]) - avg)**2, v))/len(v)
    return math.sqrt(var)

tups = []
with open("try.csv") as f:
    rdr = csv.reader(f, delimiter='\n')
    for line in rdr:
        tups.append(tuple(line[0].split(',')))
tups = tups[1:]

d = {}
for (city, job, sal) in tups:
    d.setdefault(job, []).append((city, job, sal))

for k, v in d.items():
    print k, std_dev(v)