我正在尝试读取.csv文件并提取特定的列,以便我可以输出一个表,该表本质上对特定的列执行“ GROUP BY”并汇总某些其他感兴趣的列(类似于您在SQL中是可以的),但我不太熟悉如何在Python中轻松地做到这一点。
csv文件的格式如下:
age,education,balance,approved
30,primary,1850,yes
54,secondary,800,no
24,tertiary,240,yes
我尝试导入并读取csv文件以解析我关心的三列,并对其进行迭代以将它们放入三个单独的数组列表中。我对包以及如何将它们放入3列的数据框或矩阵中不太熟悉,以便随后可以遍历它们进行变异或执行所有汇总的输出字段(请参见下面的预期结果)。
with open('loans.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter = ',')
next(readCSV) ##skips header row
education = []
balance = []
loan_approved = []
for row in readCSV:
educat = row[1]
bal = row[2]
approve = row[3]
education.append(educat)
balance.append(bal)
loan_approved.append(approve)
print(education)
print(balance)
print(loan_approved)
输出将是一个四行的4x7表(按教育程度分组)和以下标头:
Education|#Applicants|Min Bal|Max Bal|#Approved|#Rejected|%Apps Approved
Primary ...
Secondary ...
Terciary ...
答案 0 :(得分:1)
改为使用Pandas似乎要简单得多。例如,您只能读取自己需要的列,而不是全部:
import Pandas as pd
df = pd.read_csv(usecols=['education', 'balance', 'loan_approved'])
现在,要按教育程度分组,您可以找到该列的所有唯一条目并将其分组:
groupby_education = {}
for level in list(set(df['education'])):
groupby_education[level] = df.loc[df['education'] == level]
print(groupby_education)
我希望这会有所帮助。让我知道您是否仍然需要帮助。 干杯!