使用Python从CSV文件中查找中值

时间:2015-04-14 18:08:29

标签: python median

我有一个名为' salaries.csv'的CSV文件文件内容如下:

  

市工作,薪酬
  德里,医生,500个
  德里,律师,400个
  德里,管道工,100个
  伦敦,医生,800个
  伦敦,律师,700个
  伦敦,管道工,300个
  东京,医生,900个
  东京,律师,800个
  东京,管道工,400个
  律师,医生,300个
  律师,律师,400个
  律师,管道工,500个
  香港,医生,1800
  香港,律师,1100
  香港,水管工,1000
  莫斯科,医生,300个
  莫斯科,律师,200个
  莫斯科,管道工,100个
  柏林,医生,800个
  柏林,管道工,900个
  巴黎,医生,900个
  巴黎,律师,800个
  巴黎,管道工,500个
  巴黎,狗捕手,400

我需要打印每个职业的中位数薪水。我尝试了一个代码,它显示了一些错误。

我的代码是:

from StringIO import StringIO
import sqlite3
import csv
import operator #from operator import itemgetter, attrgetter

data = open('sal.csv', 'r').read()
string = ''.join(data)
f = StringIO(string)
reader = csv.reader(f)
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('''create table data (City text, Job text, Salary real)''')
conn.commit()
count = 0

for e in reader:
    if count==0:
        print ""
    else:
        e[0]=str(e[0])
        e[1]=str(e[1])
        e[2] = float(e[2])
        c.execute("""insert into data values (?,?,?)""", e)
        count=count+1
        conn.commit()

labels = []
counts = []
count = 0
c.execute('''select count(Salary),Job from data group by Job''')

for row in c:
      for i in row:
            if count==0:
               counts.append(i)
               count=count+1
           else:
                count=0
      labels.append(i)

c.execute('''select Salary,Job from data order by Job''')

count = 1
count1 = 1
temp = 0
pri = 0
lis = []

for row in c:
      lis.append(row)
for cons in counts:
      if cons%2 == 0:
         pri = cons/2
     else:
         pri = (cons+1)/2
     if count1 == 1:
        for li in lis:
              if count == pri:
                  print "Median is ",li
        count = count + 1
        count = 0
        temp = pri+cons
     else:
        for li in lis:
              if count == temp:
                  print "Median is",li
              count = count+1
              count = 0
              temp = temp + pri
       count1 = count1 + 1

然而,它显示出一些错误:

IndentationError('expected an indented block', ('', 28, 2, 'if count==0:\n'))

如何修复错误?

3 个答案:

答案 0 :(得分:3)

您可以使用defaultdict为每个职业提供所有工资,然后获得中位数。

import csv
from collections import defaultdict

with open("C:/Users/jimenez/Desktop/a.csv","r") as f:
    d = defaultdict(list)
    reader = csv.reader(f)
    reader.next()
    for row in reader:
        d[row[1]].append(float(row[2]))   

for k,v in d.iteritems():
    print "{} median is {}".format(k,sorted(v)[len(v) // 2])
    print "{} average is {}".format(k,sum(v)/len(v))

输出

Plumbers median is 500.0
Plumbers average is 475.0
Lawyers median is 700.0
Lawyers average is 628.571428571
Dog catchers median is 400.0
Dog catchers average is 400.0
Doctors median is 800.0
Doctors average is 787.5

答案 1 :(得分:1)

如果您使用pandashttp://pandas.pydata.org):

,则很容易
import pandas as pd
df = pd.read_csv('test.csv', names=['City', 'Job', 'Salary'])
df.groupby('Job').median()

#               Salary
# Job                 
# Doctors          800
# Dog catchers     400
# Lawyers          700
# Plumbers         450

如果你想要平均而不是中位数,

df.groupby('Job').mean()

#                   Salary
# Job                     
# Doctors       787.500000
# Dog catchers  400.000000
# Lawyers       628.571429
# Plumbers      475.000000

答案 2 :(得分:0)

如果你的问题是计算他的中位数,而不是在SQL数据库中插入所有内容并加扰它, 这是一个只读取所有行,将所有工资分组并从中获取中位数的问题 - 这会将您的百行级脚本减少到:

import csv
professions = {}

with open("sal.csv") as data:
    for city, profession, salary in csv.reader(data):
        professions.setdefault(profession.strip(), []).append(int(salary.strip()))

for profession, salaries in sorted(professions.items()):
    print ("{}: {}".format(profession, sorted(salaries)[len(salaries//2)] ))

(给予或取“1”以从分类工资中获得正确的中位数)