我正在寻找以下标准的Python(无熊猫,无numpy)解决方案。
我有一些大型的csv文件。假设:
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |Status |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |COMPLETE|
11 | D... | 132 W.. | Mo.. | AL... | 3.. | 334.. | 02023 | 10011 |WITHDRAWN|
首先,并非所有文件都具有相同的头,并且列的顺序也不一定相同。例如,“状态”标头可以是“状态”或“ Case_status”或“ CASE”,单个记录的“状态”的值可以是“ COMPLETE”,“ COMPLETE-WITHDRAWN”,“ INCOMPLETE”等我想做的是通过仅选择“状态”值为“完成”的数据来过滤数据,然后根据每个州具有多少“完成”状态来查找(列出)前10个州。 / p>
我知道列号后就可以执行此操作,并且找到了一些用lambda过滤行的帮助,但是我很难找到一种简单的Python解决方案来按名称过滤列。
当知道列的名称和顺序时,.csv文件的简单解决方案:
import csv
from collections import Counter
with open('input.csv') as f:
next(f) # skip header
#selecting columns by number, but I want to get by name
if row[9] == 'COMPLETE':
occurrence = Counter(row[4] for row in csv.reader(f, delimiter=';'))
print(occurrence)
total=sum(occurrence.values())
print(total)
#print the top 10 records
from __future__ import division
with open('output.txt', 'w') as f:
writer = csv.writer(f,delimiter=";")
writer.writerow(['Top 10 States','Number of complete cases','PERCENTAGE'])
for (src), cnt in occurrence.items():
percent = cnt / total
writer.writerow([src, cnt, "{0:.1%}".format(percent)])
所以输出看起来像:
Top 10 States;Number of certified cases;PERCENTAGE
Wisconsin;11;16.9%
California;10;15.4%
Arizona;9;13.8%
Texas;8;12.3%
Kentucky;7;10.8%
Maine;6;9.2%
Oklahoma;5;7.7%
Ohio;4;6.2%
New York;3;4.6%
Florida;2;3.08%
但这无济于事,因为我必须首先按名称过滤列,然后根据状态进行选择。
尝试进行过滤的较为完整的解决方案:
from __future__ import division
import csv
import itertools
from collections import Counter
from collections import defaultdict
f = open('input.csv');
next(f) # skip header
reader = csv.reader(f, delimiter=';')
#if row[2] == 'COMPLETE':
filtered_complete = itertools.ifilter(lambda x: 'COMPLETE' == x[2], reader)
occurrence = Counter(row[24] for row in filtered_complete)
print(occurrence) #check totals
total=sum(occurrence.values())
print(total) #check totals
with open('output.txt', 'w') as f:
writer = csv.writer(f,delimiter=";")
writer.writerow(['Top 10 States','Number of complete cases','PERCENTAGE'])
for (src), cnt in occurrence.items():
percent = cnt / total
writer.writerow([src, cnt, "{0:.1%}".format(percent)])
但这对按名称过滤列没有帮助,并且在尝试使用时会出现KeyError
itertools.ifilter(lambda x: 'Complete' == 'Status', reader)
有什么想法吗?谢谢!
答案 0 :(得分:1)
您也可以使用csv.DictReader
来读入列名。它将每行读为dict
,键将是列名,值将是行值,因此,假设列名是Status
,则使用列名的过滤器将如下所示:< / p>
f = open('input.csv');
reader = csv.DictReader(f, delimiter=';')
filtered_complete = itertools.ifilter(lambda x: x['Status'] == 'COMPLETE', reader)
要处理多个列名,如果您知道列名可以包含case
或status
等,则可以编写一个正则表达式来查找列名。
import re
f = open('input.csv')
reader = list(csv.DictReader(f, delimiter=';'))
# Assuming that there will be atleast one column having either case or status in it
col_names = reader[0].keys()
col_name_regex = re.compile(r'(case|status)')
status_col_name = list(filter(col_name_regex.search, col_names))[0]
filtered_complete = itertools.ifilter(lambda x: x[status_col_name] == 'COMPLETE', reader)
免责声明:此代码未经测试。我只是在此编辑器中编写的。