请帮忙!我尝试过不同的东西/包编写一个程序,它接受4个输入,并根据csv文件的输入组合返回组的写入分数统计。这是我的第一个项目,所以我很感激任何见解/提示/提示!
这是csv示例(共有200行):
id gender ses schtyp prog write
70 male low public general 52
121 female middle public vocation 68
86 male high public general 33
141 male high public vocation 63
172 male middle public academic 47
113 male middle public academic 44
50 male middle public general 59
11 male middle public academic 34
84 male middle public general 57
48 male middle public academic 57
75 male middle public vocation 60
60 male middle public academic 57
这是我到目前为止所做的:
import csv
import numpy
csv_file_object=csv.reader(open('scores.csv', 'rU')) #reads file
header=csv_file_object.next() #skips header
data=[] #loads data into array for processing
for row in csv_file_object:
data.append(row)
data=numpy.array(data)
#asks for inputs
gender=raw_input('Enter gender [male/female]: ')
schtyp=raw_input('Enter school type [public/private]: ')
ses=raw_input('Enter socioeconomic status [low/middle/high]: ')
prog=raw_input('Enter program status [general/vocation/academic: ')
#makes them lower case and strings
prog=str(prog.lower())
gender=str(gender.lower())
schtyp=str(schtyp.lower())
ses=str(ses.lower())
我缺少的是如何过滤和获取仅针对特定组的统计信息。例如,假设我输入男性,公共,中间和学术 - 我想获得该子集的平均写作分数。我尝试了pandas中的groupby函数,但这只能获得广泛组的统计信息(例如public vs private)。我也尝试过来自pandas的DataFrame,但这只能让我过滤一个输入而不确定如何获得写作分数。任何提示将不胜感激!
答案 0 :(得分:1)
同意Ramon,Pandas绝对是您要走的路,并且一旦您习惯了Pandas就具有非凡的过滤/子设置功能。但是首先要把头包起来(或者至少对我来说是这样的!)可能很难,所以我从一些旧代码中挖出了一些你需要的子设置示例。下面的变量itu
是一个Pandas DataFrame,其中包含不同国家/地区的数据。
# Subsetting by using True/False:
subset = itu['CntryName'] == 'Albania' # returns True/False values
itu[subset] # returns 1x144 DataFrame of only data for Albania
itu[itu['CntryName'] == 'Albania'] # one-line command, equivalent to the above two lines
# Pandas has many built-in functions like .isin() to provide params to filter on
itu[itu.cntrycode.isin(['USA','FRA'])] # returns where itu['cntrycode'] is 'USA' or 'FRA'
itu[itu.year.isin([2000,2001,2002])] # Returns all of itu for only years 2000-2002
# Advanced subsetting can include logical operations:
itu[itu.cntrycode.isin(['USA','FRA']) & itu.year.isin([2000,2001,2002])] # Both of above at same time
# Use .loc with two elements to simultaneously select by row/index & column:
itu.loc['USA','CntryName']
itu.iloc[204,0]
itu.loc[['USA','BHS'], ['CntryName', 'Year']]
itu.iloc[[204, 13], [0, 1]]
# Can do many operations at once, but this reduces "readability" of the code
itu[itu.cntrycode.isin(['USA','FRA']) &
itu.year.isin([2000,2001,2002])].loc[:, ['cntrycode','cntryname','year','mpen','fpen']]
# Finally, if you're comfortable with using map() and list comprehensions,
you can do some advanced subsetting that includes evaluations & functions
to determine what elements you want to select from the whole, such as all
countries whose name begins with "United":
criterion = itu['CntryName'].map(lambda x: x.startswith('United'))
itu[criterion]['CntryName'] # gives us UAE, UK, & US
答案 1 :(得分:0)
看看pandas。我认为它会缩短你的csv解析工作,并给出你要求的子集功能......
import pandas as pd
data = pd.read_csv('fileName.txt', delim_whitespace=True)
#get all of the male students
data[data['gender'] == 'male']